Now, let's perform some data manipulation steps:
- First, we will read the data in HousePrices.csv from our current working directory and create our first DataFrame for manipulation. We name the DataFrame housepricesdata, as follows:
housepricesdata = pd.read_csv("HousePrices.csv")
- Let's now take a look at our DataFrame and see how it looks:
# See first five observations from top
housepricesdata.head(5)
You might not be able to see all the rows; Jupyter will truncate some of the variables. In order to view all of the rows and columns for any output in Jupyter, execute the following commands:
# Setting options to display all rows and columns
pd.options.display.max_rows = None
pd.options.display.max_columns = None
- We can see the dimensions of the DataFrame with shape. shape is an attribute of the pandas DataFrame:
housepricesdata.shape
With the preceding command, we can see the number of rows and columns, as follows:
(1460, 81)
Here, we can see that the DataFrame has 1460 observations and 81 columns.
- Let's take a look at the datatypes of the variables in the DataFrame:
housepricesdata.dtypes
In the following code block, we can see the datatypes of each variable in the DataFrame:
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
LotConfig object
LandSlope object
...
BedroomAbvGr int64
KitchenAbvGr int64
KitchenQual object
TotRmsAbvGrd int64
SaleCondition object
SalePrice int64
Length: 81, dtype: object
We're now all ready to start with our data manipulation, which we can do in many different ways. In this section, we'll look at a few ways in which we can manipulate and prepare our data for the purpose of analysis.
Let's start by summarizing our data.
- The describe() function will show the statistics for the numerical variables only:
housepricesdata.describe()
We can see the output in the following screenshot:
- We will remove the id column, as this will not be necessary for our analysis:
# inplace=True will overwrite the DataFrame after dropping Id column
housepricesdata.drop(['Id'], axis=1, inplace=True)
- Let's now look at the distribution of some of the object type variables, that is, the categorical variables. In the following example, we are going to look at LotShape and LandContour. We can study the other categorical variables of the dataset in the same way as shown in the following code block:
# Name the count column as "count"
lotshape_frequencies = pd.crosstab(index=housepricesdata["LotShape"], columns="count")
landcountour_frequencies = pd.crosstab(index=housepricesdata["LandContour"], columns="count") # Name the count column as "count"
print(lotshape_frequencies)
print("\n") # to keep a blank line for display
print(landcountour_frequencies)
- We will now see how to perform a conversion between datatypes. What we notice is that the data definition of variables such as MSSubClass, OverallQual, and OverallCond are all categorical variables. After importing the dataset, however, they appear as integers.
Prior to typecasting any variable, ensure that there are no missing values.
Here, we'll convert the variables to a categorical datatype:
# Using astype() to cast a pandas object to a specified datatype
housepricesdata['MSSubClass'] = housepricesdata['MSSubClass'].astype('object')
housepricesdata['OverallQual'] = housepricesdata['OverallQual'].astype('object')
housepricesdata['OverallCond'] = housepricesdata['OverallCond'].astype('object')
# Check the datatype of MSSubClass after type conversion
print(housepricesdata['MSSubClass'].dtype)
print('\n') # to keep a blank line for display
# Check the distribution of the levels in MSSubClass after conversion
# Make a crosstab with pd.crosstab()
# Name the count column as "count"
print(pd.crosstab(index=housepricesdata["MSSubClass"], columns="count"))
We can see the count of observations for each category of houses, as shown in the following code block:
category
col_0 count
MSSubClass
20 536
30 69
40 4
45 12
50 144
60 299
70 60
75 16
80 58
85 20
90 52
120 87
160 63
180 10
190 30
There are many variables that might not be very useful by themselves, but transforming them gives us a lot of interesting insights. Let's create some new, meaningful variables.
- YearBuilt and YearRemodAdd represent the original construction date and the remodel date respectively. However, if they can be converted into age, these variables will tell us how old the buildings are and how many years it has been since they were remodeled. To do this, we create two new variables, BuildingAge and RemodelAge:
# Importing datetime package for date time operations
import datetime as dt
# using date time package to find the current year
current_year = int(dt.datetime.now().year)
# Subtracting the YearBuilt from current_year to find out the age of the building
building_age = current_year - housepricesdata['YearBuilt']
# Subtracting the YearRemonAdd from current_year to find out the age since the
# building was remodelled
remodelled_age = current_year - housepricesdata['YearRemodAdd']
- Now, let's add the two variables to our dataset:
# Adding the two variables to the DataFrame
housepricesdata['building_age'] = building_age
housepricesdata['remodelled_age'] = remodelled_age
# Checking our DataFrame to see if the two variables got added
housepricesdata.head(5)
We notice that building_age and remodelled_age are now added to the DataFrame, as shown in the following screenshot:
Variables that contain label data need to be converted into a numerical form for machine learning algorithms to use. To get around this, we will perform encoding that will transform the labels into numerical forms so that the algorithms can use them.
- We need to identify the variables that need encoding, which include Street, LotShape, and LandContour. We will perform one-hot encoding, which is a representation of categorical variables as binary vectors. We will use the pandas package in Python to do this:
# We use get_dummies() function to one-hot encode LotShape
one_hot_encoded_variables = pd.get_dummies(housepricesdata['LotShape'],prefix='LotShape')
# Print the one-hot encoded variables to see how they look like
print(one_hot_encoded_variables)
We can see the one-hot encoded variables that have been created in the following screenshot:
- Add the one-hot encoded variables to our DataFrame, as follows:
# Adding the new created one-hot encoded variables to our DataFrame
housepricesdata = pd.concat([housepricesdata,one_hot_encoded_variables],axis=1)
# Let's take a look at the added one-hot encoded variables
# Scroll right to view the added variables
housepricesdata.head(5)
We can see the output that we get after adding the one-hot encoded variables to the DataFrame in the following screenshot:
- Now, let's remove the original variables since we have already created our one-hot encoded variables:
# Dropping the original variable after one-hot encoding the original variable
# inplace = True option will overwrite the DataFrame
housepricesdata.drop(['LotShape'],axis=1, inplace=True)