# Friday, 25 May 2018

One of the challenges of working with data is what to do with missing data.

A missing column in a dataset can be, but is not limited to the following:

  • No text or numbers between 2 column delimiters
  • An empty string ("")
  • A blank string (e.g., "   ")
  • The number 0
  • A special indicator, such as "NA" or "NONE"
  • An inconsistent data type, such as a number where a string is expected
  • A value that makes no sense in the context of the data.

The last one requires some domain knowledge about the data, so it is often difficult to spot.

There are two strategies for dealing with missing data

  1. Delete or ignore the entire row
  2. Replace the column with a reasonable value.

If only a few rows contain missing data, it may be efficient to simply delete these rows.

But if many rows contain missing data, it probably makes sense to keep them as other columns may contain valuable information. In this case, we will want to replace the missing data with a reasonable value.

But what is a reasonable value?

Options include replacing the column with an average value, such as the mean or median of the non-missing values. Of course, this is only valid for numeric data that is ordinal, that is data in which higher numbers indicate a higher value and not simply a discrete category.

The Pandas library contains some simple functions for deleting rows and replacing values. The fillna function is the simplest way to do this.

# Replace all missing values with 0
df.fillna(0)

# Replace all missing values with the string 'Missing'
df.fillna('Missing')
  

You can delete invalid or missing rows by overwriting a dataset with a filtered version of that set, as in the following examples

# Delete all rows with area = 0 
df = df[df.area != 0]

# Delete all rows with null area 
df = df[df.area.notnull()] 
  

But for values that are not missing, but are inappropriate (e.g., using 0 to represent a missing data point, when 0 could be a valid measurement), we can use the map function.

Below, we use the map function to want to replace any value in the 'area' column that has a value of 0 with the mean value for this column.

# Replace 0 area with the mean
mean_area = df['area'].mean()   
df['area'] = df['area'].map({0: mean_area})    
  

In this article, we discussed ways to use Pandas handle missing data in a dataframe.

Comments are closed.