Book Image

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur
Book Image

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Table of Contents (9 chapters)

Chapter 5: Ensemble Modeling


Activity 14: Stacking with Standalone and Ensemble Algorithms

Solution

  1. Import the relevant libraries:

    import pandas as pd
    import numpy as np
    import seaborn as sns
    
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import KFold
    
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
  2. Read the data and print the first five rows:

    data = pd.read_csv('house_prices.csv')
    data.head()

    The output will be as follows:

    Figure 5.19: The first 5 rows

  3. Preprocess the dataset to remove null values and one-hot encode categorical variables to prepare the data for modeling.

    First, we remove all columns where more than 10% of the values are null. To do this, calculate the fraction of missing values by using the .isnull() method to get a mask DataFrame and apply the .mean() method to get the fraction of null values in each column. Multiply the result by 100 to get the series as percentage values.

    Then, find the subset of the series having a percentage value lower than 10 and save the index (which will give us the column names) as a list. Print the list to see the columns we get:

    perc_missing = data.isnull().mean()*100
    cols = perc_missing[perc_missing < 10].index.tolist() 
    cols

    The output will be:

    Figure 5.20: Output of preprocessing the dataset

    As the first column is id, we will exclude this column as well, since it will not add any value to the model.

    We will subset the data to include all columns in the col list except the first element, which is id:

    data = data.loc[:, cols[1:]]

    For the categorical variables, we replace null values with a string, NA, and one-hot encode the columns using pandas' .get_dummies() method, while for the numerical variables we will replace the null values with -1. Then, we combine the numerical and categorical columns to get the final DataFrame:

    data_obj = pd.get_dummies(data.select_dtypes(include=[np.object]).fillna('NA'))
    data_num = data.select_dtypes(include=[np.number]).fillna(-1)
    
    data_final = pd.concat([data_obj, data_num], axis=1)
  4. Divide the dataset into train and validation DataFrames.

    We use scikit-learn's train_test_split() method to divide the final DataFrame into training and validation sets in the ratio 4:1. We further split each of the two sets into their respective x and y values to represent the features and target variable respectively:

    train, val = train, val = train_test_split(data_final, test_size=0.2, random_state=11)
    
    x_train = train.drop(columns=['SalePrice'])
    y_train = train['SalePrice'].values
    
    x_val = val.drop(columns=['SalePrice'])
    y_val = val['SalePrice'].values
  5. Initialize dictionaries in which to store train and validation MAE values. We will create two dictionaries, in which we will store the MAE values on the train and validation datasets:

    train_mae_values, val_mae_values = {}, {}
  6. Train a decision tree model and save the scores. We will use scikit-learn's DecisionTreeRegressor class to train a regression model using a single decision tree:

    # Decision Tree
    
    dt_params = {
        'criterion': 'mae',
        'min_samples_leaf': 10,
        'random_state': 11
    }
    
    dt = DecisionTreeRegressor(**dt_params)
    
    dt.fit(x_train, y_train)
    dt_preds_train = dt.predict(x_train)
    dt_preds_val = dt.predict(x_val)
    
    train_mae_values['dt'] = mean_absolute_error(y_true=y_train, y_pred=dt_preds_train)
    val_mae_values['dt'] = mean_absolute_error(y_true=y_val, y_pred=dt_preds_val)
  7. Train a k-nearest neighbors model and save the scores. We will use scikit-learn's kNeighborsRegressor class to train a regression model with k=5:

    # k-Nearest Neighbors
    
    knn_params = {
        'n_neighbors': 5
    }
    
    knn = KNeighborsRegressor(**knn_params)
    
    knn.fit(x_train, y_train)
    knn_preds_train = knn.predict(x_train)
    knn_preds_val = knn.predict(x_val)
    
    train_mae_values['knn'] = mean_absolute_error(y_true=y_train, y_pred=knn_preds_train)
    val_mae_values['knn'] = mean_absolute_error(y_true=y_val, y_pred=knn_preds_val)
  8. Train a Random Forest model and save the scores. We will use scikit-learn's RandomForestRegressor class to train a regression model using bagging:

    # Random Forest
    
    rf_params = {
        'n_estimators': 50,
        'criterion': 'mae',
        'max_features': 'sqrt',
        'min_samples_leaf': 10,
        'random_state': 11,
        'n_jobs': -1
    }
    
    rf = RandomForestRegressor(**rf_params)
    
    rf.fit(x_train, y_train)
    rf_preds_train = rf.predict(x_train)
    rf_preds_val = rf.predict(x_val)
    
    train_mae_values['rf'] = mean_absolute_error(y_true=y_train, y_pred=rf_preds_train)
    val_mae_values['rf'] = mean_absolute_error(y_true=y_val, y_pred=rf_preds_val)
  9. Train a gradient boosting model and save the scores. We will use scikit-learn's GradientBoostingRegressor class to train a boosted regression model:

    # Gradient Boosting
    
    gbr_params = {
        'n_estimators': 50,
        'criterion': 'mae',
        'max_features': 'sqrt',
        'max_depth': 3,
        'min_samples_leaf': 5,
        'random_state': 11
    }
    
    gbr = GradientBoostingRegressor(**gbr_params)
    
    gbr.fit(x_train, y_train)
    gbr_preds_train = gbr.predict(x_train)
    gbr_preds_val = gbr.predict(x_val)
    
    train_mae_values['gbr'] = mean_absolute_error(y_true=y_train, y_pred=gbr_preds_train)
    val_mae_values['gbr'] = mean_absolute_error(y_true=y_val, y_pred=gbr_preds_val)
  10. Prepare the training and validation datasets with the four meta estimators having the same hyperparameters that were used in the previous steps. We will create a num_base_predictors variable that represents the number of base estimators we have in the stacked model to help calculate the shape of the datasets for training and validation. This step can be coded almost identically to the exercise in the chapter, with a different number (and type) of base estimators.

  11. First, we create a new training set with additional columns for predictions from base predictors, in the same way as was done previously:

    num_base_predictors = len(train_mae_values) # 4
    
    x_train_with_metapreds = np.zeros((x_train.shape[0], x_train.shape[1]+num_base_predictors))
    x_train_with_metapreds[:, :-num_base_predictors] = x_train
    x_train_with_metapreds[:, -num_base_predictors:] = -1

    Then, we train the base models using the k-fold strategy. We save the predictions in each iteration in a list, and iterate over the list to assign the predictions to the columns in that fold:

    kf = KFold(n_splits=5, random_state=11)
    
    for train_indices, val_indices in kf.split(x_train):
        kfold_x_train, kfold_x_val = x_train.iloc[train_indices], x_train.iloc[val_indices]
        kfold_y_train, kfold_y_val = y_train[train_indices], y_train[val_indices]
        
        predictions = []
        
        dt = DecisionTreeRegressor(**dt_params)
        dt.fit(kfold_x_train, kfold_y_train)
        predictions.append(dt.predict(kfold_x_val))
    
        knn = KNeighborsRegressor(**knn_params)
        knn.fit(kfold_x_train, kfold_y_train)
        predictions.append(knn.predict(kfold_x_val))
    
        gbr = GradientBoostingRegressor(**gbr_params)
        rf.fit(kfold_x_train, kfold_y_train)
        predictions.append(rf.predict(kfold_x_val))
    
        gbr = GradientBoostingRegressor(**gbr_params)
        gbr.fit(kfold_x_train, kfold_y_train)
        predictions.append(gbr.predict(kfold_x_val))
        
        for i, preds in enumerate(predictions):
            x_train_with_metapreds[val_indices, -(i+1)] = preds

    After that, we create a new validation set with additional columns for predictions from base predictors:

    x_val_with_metapreds = np.zeros((x_val.shape[0], x_val.shape[1]+num_base_predictors))
    x_val_with_metapreds[:, :-num_base_predictors] = x_val
    x_val_with_metapreds[:, -num_base_predictors:] = -1
  12. Lastly, we fit the base models on the complete training set to get meta features for the validation set:

    predictions = []
        
    dt = DecisionTreeRegressor(**dt_params)
    dt.fit(x_train, y_train)
    predictions.append(dt.predict(x_val))
    
    knn = KNeighborsRegressor(**knn_params)
    knn.fit(x_train, y_train)
    predictions.append(knn.predict(x_val))
    
    gbr = GradientBoostingRegressor(**gbr_params)
    rf.fit(x_train, y_train)
    predictions.append(rf.predict(x_val))
    
    gbr = GradientBoostingRegressor(**gbr_params)
    gbr.fit(x_train, y_train)
    predictions.append(gbr.predict(x_val))
    
    for i, preds in enumerate(predictions):
        x_val_with_metapreds[:, -(i+1)] = preds
  13. Train a linear regression model as the stacked model. To train the stacked model, we train the logistic regression model on all the columns of the training dataset, plus the meta predictions from the base estimators. We then use the final predictions to calculate the MAE values, which we store in the same train_mae_values and val_mae_values dictionaries:

    lr = LinearRegression(normalize=False)
    lr.fit(x_train_with_metapreds, y_train)
    lr_preds_train = lr.predict(x_train_with_metapreds)
    lr_preds_val = lr.predict(x_val_with_metapreds)
    
    train_mae_values['lr'] = mean_absolute_error(y_true=y_train, y_pred=lr_preds_train)
    val_mae_values['lr'] = mean_absolute_error(y_true=y_val, y_pred=lr_preds_val)
  14. Visualize the train and validation errors for each individual model and the stacked model. Then, we will convert the dictionaries into two series and combine them to form two columns of a Pandas DataFrame:

    mae_scores = pd.concat([pd.Series(train_mae_values, name='train'), 
                            pd.Series(val_mae_values, name='val')], 
                           axis=1)
    mae_scores

    The output will be as follows:

    Figure 5.21: The train and validation errors for each individual model and the stacked model

  15. We then plot a bar chart from this DataFrame to visualize the MAE values for the train and validation sets using each model:

    mae_scores.plot(kind='bar', figsize=(10,7))
    plt.ylabel('MAE')
    plt.xlabel('Model')
    plt.show()

    The output will be as follows:

    Figure 5.22: Bar chart visualizing the MAE values

As we can see in the plot, the linear regression stacked model has the lowest value of mean absolute error on both training and validation datasets, even compared to the other ensemble models (Random Forest and gradient boosted regressor).