#### Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Applied Supervised Learning with Python
Preface
Free Chapter
Python Machine Learning Toolkit
Exploratory Data Analysis and Visualization
Regression Analysis
Classification
Ensemble Modeling
Model Evaluation

## Chapter 5: Ensemble Modeling

### Activity 14: Stacking with Standalone and Ensemble Algorithms

Solution

1. Import the relevant libraries:

```import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
2. Read the data and print the first five rows:

```data = pd.read_csv('house_prices.csv')

The output will be as follows:

Figure 5.19: The first 5 rows

3. Preprocess the dataset to remove null values and one-hot encode categorical variables to prepare the data for modeling.

First, we remove all columns where more than 10% of the values are null. To do this, calculate the fraction of missing values by using the .isnull() method to get a mask DataFrame and apply the .mean() method to get the fraction of null values in each column. Multiply the result by 100 to get the series as percentage values.

Then, find the subset of the series having a percentage value lower than 10 and save the index (which will give us the column names) as a list. Print the list to see the columns we get:

```perc_missing = data.isnull().mean()*100
cols = perc_missing[perc_missing < 10].index.tolist()
cols```

The output will be:

Figure 5.20: Output of preprocessing the dataset

As the first column is id, we will exclude this column as well, since it will not add any value to the model.

We will subset the data to include all columns in the col list except the first element, which is id:

`data = data.loc[:, cols[1:]]`

For the categorical variables, we replace null values with a string, NA, and one-hot encode the columns using pandas' .get_dummies() method, while for the numerical variables we will replace the null values with -1. Then, we combine the numerical and categorical columns to get the final DataFrame:

```data_obj = pd.get_dummies(data.select_dtypes(include=[np.object]).fillna('NA'))
data_num = data.select_dtypes(include=[np.number]).fillna(-1)

data_final = pd.concat([data_obj, data_num], axis=1)```
4. Divide the dataset into train and validation DataFrames.

We use scikit-learn's train_test_split() method to divide the final DataFrame into training and validation sets in the ratio 4:1. We further split each of the two sets into their respective x and y values to represent the features and target variable respectively:

```train, val = train, val = train_test_split(data_final, test_size=0.2, random_state=11)

x_train = train.drop(columns=['SalePrice'])
y_train = train['SalePrice'].values

x_val = val.drop(columns=['SalePrice'])
y_val = val['SalePrice'].values```
5. Initialize dictionaries in which to store train and validation MAE values. We will create two dictionaries, in which we will store the MAE values on the train and validation datasets:

`train_mae_values, val_mae_values = {}, {}`
6. Train a decision tree model and save the scores. We will use scikit-learn's DecisionTreeRegressor class to train a regression model using a single decision tree:

```# Decision Tree

dt_params = {
'criterion': 'mae',
'min_samples_leaf': 10,
'random_state': 11
}

dt = DecisionTreeRegressor(**dt_params)

dt.fit(x_train, y_train)
dt_preds_train = dt.predict(x_train)
dt_preds_val = dt.predict(x_val)

train_mae_values['dt'] = mean_absolute_error(y_true=y_train, y_pred=dt_preds_train)
val_mae_values['dt'] = mean_absolute_error(y_true=y_val, y_pred=dt_preds_val)```
7. Train a k-nearest neighbors model and save the scores. We will use scikit-learn's kNeighborsRegressor class to train a regression model with k=5:

```# k-Nearest Neighbors

knn_params = {
'n_neighbors': 5
}

knn = KNeighborsRegressor(**knn_params)

knn.fit(x_train, y_train)
knn_preds_train = knn.predict(x_train)
knn_preds_val = knn.predict(x_val)

train_mae_values['knn'] = mean_absolute_error(y_true=y_train, y_pred=knn_preds_train)
val_mae_values['knn'] = mean_absolute_error(y_true=y_val, y_pred=knn_preds_val)```
8. Train a Random Forest model and save the scores. We will use scikit-learn's RandomForestRegressor class to train a regression model using bagging:

```# Random Forest

rf_params = {
'n_estimators': 50,
'criterion': 'mae',
'max_features': 'sqrt',
'min_samples_leaf': 10,
'random_state': 11,
'n_jobs': -1
}

rf = RandomForestRegressor(**rf_params)

rf.fit(x_train, y_train)
rf_preds_train = rf.predict(x_train)
rf_preds_val = rf.predict(x_val)

train_mae_values['rf'] = mean_absolute_error(y_true=y_train, y_pred=rf_preds_train)
val_mae_values['rf'] = mean_absolute_error(y_true=y_val, y_pred=rf_preds_val)```
9. Train a gradient boosting model and save the scores. We will use scikit-learn's GradientBoostingRegressor class to train a boosted regression model:

```# Gradient Boosting

gbr_params = {
'n_estimators': 50,
'criterion': 'mae',
'max_features': 'sqrt',
'max_depth': 3,
'min_samples_leaf': 5,
'random_state': 11
}

gbr.fit(x_train, y_train)
gbr_preds_train = gbr.predict(x_train)
gbr_preds_val = gbr.predict(x_val)

train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, y_pred=gbr_preds_train)
val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, y_pred=gbr_preds_val)```
10. Prepare the training and validation datasets with the four meta estimators having the same hyperparameters that were used in the previous steps. We will create a num_base_predictors variable that represents the number of base estimators we have in the stacked model to help calculate the shape of the datasets for training and validation. This step can be coded almost identically to the exercise in the chapter, with a different number (and type) of base estimators.

11. First, we create a new training set with additional columns for predictions from base predictors, in the same way as was done previously:

```num_base_predictors = len(train_mae_values) # 4

x_train_with_metapreds = np.zeros((x_train.shape[0], x_train.shape[1]+num_base_predictors))
x_train_with_metapreds[:, :-num_base_predictors] = x_train
x_train_with_metapreds[:, -num_base_predictors:] = -1```

Then, we train the base models using the k-fold strategy. We save the predictions in each iteration in a list, and iterate over the list to assign the predictions to the columns in that fold:

```kf = KFold(n_splits=5, random_state=11)

for train_indices, val_indices in kf.split(x_train):
kfold_x_train, kfold_x_val = x_train.iloc[train_indices], x_train.iloc[val_indices]
kfold_y_train, kfold_y_val = y_train[train_indices], y_train[val_indices]

predictions = []

dt = DecisionTreeRegressor(**dt_params)
dt.fit(kfold_x_train, kfold_y_train)
predictions.append(dt.predict(kfold_x_val))

knn = KNeighborsRegressor(**knn_params)
knn.fit(kfold_x_train, kfold_y_train)
predictions.append(knn.predict(kfold_x_val))

rf.fit(kfold_x_train, kfold_y_train)
predictions.append(rf.predict(kfold_x_val))

gbr.fit(kfold_x_train, kfold_y_train)
predictions.append(gbr.predict(kfold_x_val))

for i, preds in enumerate(predictions):
x_train_with_metapreds[val_indices, -(i+1)] = preds```

After that, we create a new validation set with additional columns for predictions from base predictors:

```x_val_with_metapreds = np.zeros((x_val.shape[0], x_val.shape[1]+num_base_predictors))
x_val_with_metapreds[:, :-num_base_predictors] = x_val
x_val_with_metapreds[:, -num_base_predictors:] = -1```
12. Lastly, we fit the base models on the complete training set to get meta features for the validation set:

```predictions = []

dt = DecisionTreeRegressor(**dt_params)
dt.fit(x_train, y_train)
predictions.append(dt.predict(x_val))

knn = KNeighborsRegressor(**knn_params)
knn.fit(x_train, y_train)
predictions.append(knn.predict(x_val))

rf.fit(x_train, y_train)
predictions.append(rf.predict(x_val))

gbr.fit(x_train, y_train)
predictions.append(gbr.predict(x_val))

for i, preds in enumerate(predictions):
x_val_with_metapreds[:, -(i+1)] = preds```
13. Train a linear regression model as the stacked model. To train the stacked model, we train the logistic regression model on all the columns of the training dataset, plus the meta predictions from the base estimators. We then use the final predictions to calculate the MAE values, which we store in the same train_mae_values and val_mae_values dictionaries:

```lr = LinearRegression(normalize=False)
lr.fit(x_train_with_metapreds, y_train)
lr_preds_train = lr.predict(x_train_with_metapreds)
lr_preds_val = lr.predict(x_val_with_metapreds)

train_mae_values['lr'] = mean_absolute_error(y_true=y_train, y_pred=lr_preds_train)
val_mae_values['lr'] = mean_absolute_error(y_true=y_val, y_pred=lr_preds_val)```
14. Visualize the train and validation errors for each individual model and the stacked model. Then, we will convert the dictionaries into two series and combine them to form two columns of a Pandas DataFrame:

```mae_scores = pd.concat([pd.Series(train_mae_values, name='train'),
pd.Series(val_mae_values, name='val')],
axis=1)
mae_scores```

The output will be as follows:

Figure 5.21: The train and validation errors for each individual model and the stacked model

15. We then plot a bar chart from this DataFrame to visualize the MAE values for the train and validation sets using each model:

```mae_scores.plot(kind='bar', figsize=(10,7))
plt.ylabel('MAE')
plt.xlabel('Model')
plt.show()```

The output will be as follows:

Figure 5.22: Bar chart visualizing the MAE values

As we can see in the plot, the linear regression stacked model has the lowest value of mean absolute error on both training and validation datasets, even compared to the other ensemble models (Random Forest and gradient boosted regressor).