-
Book Overview & Buying
-
Table Of Contents
Exploratory Data Analysis with Python Cookbook
By :
Dealing with missing values is a common problem we will typically face when analyzing data. A missing value is a value within a field or variable that is not present, even though it is expected to be. There are several reasons why this could have happened, but a common reason is that the data value wasn’t provided at the point of data collection. As we explore and analyze data, missing values can easily lead to inaccurate or biased conclusions; therefore, they need to be taken care of. Missing values are typically represented by blank spaces, but in pandas, they are represented by NaN.
Several techniques can be used to deal with missing values. In this recipe, we will focus on dropping missing values using the dropna method in pandas.
We will work with the full Marketing Campaign data in this recipe.
We will drop rows and columns using the pandas library:
pandas library:import pandas as pd
.csv into a dataframe using read_csv. Then, subset the dataframe to include only relevant columns:marketing_data = pd.read_csv("data/marketing_campaign.csv")marketing_data = marketing_data[['ID', 'Year_Birth', 'Education','Income']]
marketing_data.head()
ID Year_Birth Education Income
0 5524 1957 Graduation 58138.0
1 2174 1954 Graduation 46344.0
2 4141 1965 Graduation 71613.0
3 6182 1984 Graduation 26646.0
4 5324 1981 PhD 58293.0
marketing_data.shape
(2240, 4)
isnull and sum methods:marketing_data.isnull().sum()
ID 0
Year_Birth 0
Education 0
Income 24
dropna method:marketing_data_withoutna = marketing_data.dropna(how = 'any')
marketing_data_withoutna.shape
(2216, 4)
Good job! We have dropped missing values from our dataset.
In step 1, we import pandas and refer to it as pd. In step 2, we use read_csv to load the .csv file into a pandas dataframe and call it marketing_data. We also subset the dataframe to include only four relevant columns. In step 3, we inspect the dataset using the head and shape methods.
In step 4, we use the isnull and sum methods to check for missing values. These methods give us the number of rows with missing values within each column in our dataset. Columns with zero have no rows with missing values.
In step 5, we use the dropna method to drop missing values. For the how parameter, we supply 'any' as the value to indicate that we want to drop rows that have missing values in any of the columns. An alternative value to use is 'all', which ensures all the columns of a row have missing values before dropping the row. We then check the number of rows and columns using the shape method. We can see that the final dataset has 24 fewer rows.
As highlighted previously, there are several reasons why a value may be missing in our dataset. Understanding the reason can point us to optimal solutions to resolve this problem. Missing values shouldn’t be addressed with a one-size-fits-all approach. Chapter 9 dives into the details of how to optimally deal with missing values and outliers by providing several techniques.
You can check out a detailed approach to dealing with missing values and outliers in Chapter 9, Dealing with Outliers and Missing Values.