Applied Supervised Learning with Python

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Buy this Book

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Buy this Book

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!

Applied Supervised Learning with Python

Preface

Free Chapter

Python Machine Learning Toolkit

Introduction

Supervised Machine Learning

Jupyter Notebooks

pandas

Data Quality Considerations

Summary

Exploratory Data Analysis and Visualization

Introduction

Summary Statistics and Central Values

Missing Values

Distribution of Values

Relationships within the Data

Summary

Regression Analysis

Introduction

Regression and Classification Problems

Linear Regression

Multiple Linear Regression

Autoregression Models

Summary

Classification

Introduction

Linear Regression as a Classifier

Logistic Regression

Classification Using K-Nearest Neighbors

Classification Using Decision Trees

Summary

Ensemble Modeling

Introduction

Overfitting and Underfitting

Bagging

Boosting

Summary

Model Evaluation

Introduction

Evaluation Metrics

Splitting the Dataset

Performance Improvement Tactics

Summary

Appendix

Chapter 1: Python Machine Learning Toolkit

Chapter 2: Exploratory Data Analysis and Visualization

Chapter 3: Regression Analysis

Chapter 4: Classification

Chapter 5: Ensemble Modeling

Chapter 6: Model Evaluation

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 3: Regression Analysis

Activity 5: Plotting Data with a Moving Average

Solution

Load the dataset into a pandas DataFrame from the CSV file:
```
df = pd.read_csv('austin_weather.csv')
df.head()
```
The output will show the initial five rows of the austin_weather.csv file:
Figure 3.74: The first five rows of the Austin weather data
Since we only need the Date and TempAvgF columns, we'll remove all others from the dataset:
```
df = df[['Date', 'TempAvgF']]
df.head()
```
The output will be:
Figure 3.75: Date and TempAvgF columns of the Austin weather data
Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column. Note that temperatures are recorded daily:
```
df['Year'] = [int(dt[:4]) for dt in df.Date]
df.head()
```
The output will be:
Figure 3.76: Extracting the year
Repeat this process to extract the month values and store the values as integers in a Month column:
```
df['Month'] = [int(dt[5:7]) for dt in df.Date]
df.head()
```
The output will be:
Figure 3.77: Extracting the month
Copy the first year's worth of data to a DataFrame:
```
df_first_year = df[:365]
df_first_year.head()
```
The output will be as follows:
Figure 3.78: Copied data to new DataFrame

Compute a 20-day moving average filter:

window = 20
rolling = df_first_year.TempAvgF.rolling(window).mean();
rolling.head(n=20)

The output will be:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19    47.75
Name: TempAvgF, dtype: float64

Plot the raw data and the moving average signal, with the x axis as the day number in the year:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(range(1, 366), df_first_year.TempAvgF, label='Raw Data');
ax.plot(range(1, 366), rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.79: Scatter plot of temperature throughout the year

Activity 6: Linear Regression Using the Least Squares Method

Solution

Visualize the measurements:
```
df.head()
```
The output will be as follows:
Figure 3.80: First five rows of activity2_measurements.csv dataset
Visualize the rolling average values:
```
rolling.head(n=30)
```
The output will be as follows:
Figure 3.81: Rolling head average
Create a linear regression model using the default parameters; that is, calculate a y intercept for the model and do not normalize the data:
```
model = LinearRegression()
model
```
The output will be as follows:
```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
```
Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. To make later calculations easier, insert a column (DayOfYear) that corresponds with the day of the year for that measurement:
```
df_first_year.loc[:,'DayOfYear'] = [i + 1 for i in df_first_year.index]
df_first_year.head()
```
The output will be as follows:
Figure 3.82: Adding day of year column

Fit the model with the DayOfYear values as the input and df_first_year.TempAvgF as the output:

# Note the year values need to be provided as an N x 1 array
model.fit(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Print the parameters of the model:

print(f'm = {model.coef_[0]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x + {model.intercept_:0.4f}')

The output will be:

m = 0.04909173467448788
c = 60.28196597922625

Model Definition
y = 0.04909x + 60.2820

We can calculate the trendline values by using the first, middle, and last values (days in years) in the linear equation:

trend_x = np.array([
    1,
    182.5,
    365
])

trend_y = model.predict(trend_x.reshape((-1, 1)))
trend_y

The output will be as follows:

array([60.33105771, 69.24120756, 78.20044914])

Plot these values with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.83: Scatterplot of temperature thought the year with the predicted trendline

Evaluate the performance of the model. How well does the model fit the data? Calculate the r2 score to find out:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.1222

Activity 7: Dummy Variables

Solution

Plot the raw data (df) and moving average (rolling):

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be:

Figure 3.84: Scatterplot of Temperature throughout the year

Looking at the preceding plot, there seems to be an inflection point around day 250. Create a dummy variable to introduce this feature into the linear model:
```
df_first_year.loc[:,'inflection'] = [1 * int(i < 250) for i in df_first_year.DayOfYear]
```
Check the first and last samples to confirm that the dummy variable is correct. Check the first five samples:
```
df_first_year.head()
```
The output will be as follows:
Figure 3.85: First five columns
Then, check the last five samples:
```
df_first_year.tail()
```
The output will be:
Figure 3.86: Last five columns

Use a least squares linear regression model and fit the model to the DayOfYear values and the dummy variable to predict TempAvgF:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Compute the r2 score:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be as follows:

r2 score = 0.3631

Using the DayOfYear values, create a set of predictions using the model to construct a trendline:

trend_y = model.predict(df_first_year[['DayOfYear', 'inflection']].values)
trend_y

The output will be:

array([51.60311133, 51.74622654, 51.88934175, 52.03245696, 52.17557217,
       52.31868739, 52.4618026 , 52.60491781, 52.74803302, 52.89114823,
       53.03426345, 53.17737866, 53.32049387, 53.46360908, 53.60672429,
       53.7498395 , 53.89295472, 54.03606993, 54.17918514, 54.32230035,
       54.46541556, 54.60853078, 54.75164599, 54.8947612 , 55.03787641,
…
…
       73.88056649, 74.0236817 , 74.16679692, 74.30991213, 74.45302734,
       74.59614255, 74.73925776, 74.88237297, 75.02548819, 75.1686034 ,
       75.31171861, 75.45483382, 75.59794903, 75.74106425, 75.88417946,
       76.02729467, 76.17040988, 76.31352509, 76.4566403 , 76.59975552,
       76.74287073, 76.88598594, 77.02910115, 77.17221636, 77.31533157])

Plot the trendline against the data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.87: Predicted trendline

Activity 8: Other Model Types with Linear Regression

Solution

Use a sine curve function as the basis of the model:

# Using a sine curve
df_first_year['DayOfYear2'] = np.sin(df_first_year['DayOfYear'] / df_first_year['DayOfYear'].max())
df_first_year.head()

The output will be as follows:

Figure 3.88: First five rows

Fit the model:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Print the parameters of the model:

print(f'a = {model.coef_[0]}')
print(f'm = {model.coef_[1]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x^2 + {model.coef_[1]:0.4}x + {model.intercept_:0.4f}')

The output will be as follows:

a = 634.322313570282
m = -1.4371290614190075
c = 39.93286585807408

Model Definition
y = 634.3x^2 + -1.437x + 39.9329

Compute the r2 value to measure the performance:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.7047

Construct the trendline values:

trend_y = model.predict(df_first_year[['DayOfYear2', 'DayOfYear']].values)
trend_y

The output will be:

array([40.23360397, 40.53432905, 40.83502803, 41.13568788, 41.43629555,
       41.736838  , 42.03730219, 42.33767507, 42.6379436 , 42.93809474,
       43.23811546, 43.5379927 , 43.83771344, 44.13726463, 44.43663324,
       44.73580624, 45.03477059, 45.33351327, 45.63202123, 45.93028146,
       46.22828093, 46.52600661, 46.82344549, 47.12058453, 47.41741073,
…
…
       59.96306563, 59.55705293, 59.14720371, 58.73351024, 58.31596484,
       57.89455987, 57.46928769, 57.04014072, 56.60711138, 56.17019215,
       55.7293755 , 55.28465397, 54.83602011, 54.38346649, 53.92698572,
       53.46657045, 53.00221334, 52.53390709, 52.06164442, 51.58541811,
       51.10522093, 50.62104569, 50.13288526, 49.6407325 , 49.14458033])

Plot the trendline with the raw data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.89: Predicted trendline

Activity 9: Gradient Descent

Solution

Create a generic gradient descent model and normalize the day of year values as between 0 and 1:

grad_model = SGDRegressor(max_iter=None, tol=1e-3)
_x = df_first_year.DayOfYear / df_first_year.DayOfYear.max()

Fit the model:

grad_model.fit(_x.values.reshape((-1, 1)), df_first_year.TempAvgF)

The output will be as follows:

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=None,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)

Print the details of the model:

print(f'm = {grad_model.coef_[0]}')
print(f'c = {grad_model.intercept_[0]}')

print('\nModel Definition')
print(f'y = {grad_model.coef_[0]:0.4}x + {grad_model.intercept_[0]:0.4f}')

The output will be as follows:

m = 26.406162532140563
c = 55.07470859678077

Model Definition
y = 26.41x + 55.0747

Prepare the x (_trend_x) trendline values by dividing them by the maximum. Predict y_trend_values using the gradient descent model:
```
_trend_x = trend_x / trend_x.max()
trend_y = grad_model.predict(_trend_x.reshape((-1, 1)))
trend_y
```
The output will be as follows:
```
array([55.14705425, 68.27778986, 81.48087113])
```

Plot the data and the moving average with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', linestyle='--', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.90: Gradient descent predicted trendline

Activity 10: Autoregressors

Solution

Plot the complete set of average temperature values (df.TempAvgF) with years on the x axis:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values);
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');

The output will be:

Figure 3.91: Plot of temperature through the year

Create a 20-day lag and plot the lagged data on the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(df.TempAvgF.shift(20), c='r', linestyle='--',
    label='Lag 20');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.92: Plot of temperature through the years with a 20-day lag

Construct an autocorrelation plot to see whether the average temperature can be used with an autoregressor. Where is the lag acceptable and where is it not for an autoregressor?
```
plt.figure(figsize=(10, 7))
pd.plotting.autocorrelation_plot(df.TempAvgF);
```
We'll get the following output:
Figure 3.93: Plot of autocorrelation versus lag
The lag is acceptable only when the autocorrelation line lies outside the 99% confidence bounds, as represented by the dashed lines.
Chose an acceptable lag and an unacceptable lag and construct lag plots using these values for acceptable lag:
```
plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=5);
```
We'll get the following output:
Figure 3.94: Plot of acceptable lag
Use these values for unacceptable lag:
```
plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=1000);
```
We'll get the following output:
Figure 3.95: Plot of unacceptable lag
Create an autoregressor model, note the selected lag, calculate the R2 value, and plot the autoregressor model with the original plot. The model is to project past the available data by 1,000 samples:
```
from statsmodels.tsa.ar_model import AR
model = AR(df.TempAvgF)
```

Fit the model to the data:

model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

The output will be:

Lag: 23
Coefficients: const           1.909395
L1.TempAvgF     0.912076
L2.TempAvgF    -0.334043
L3.TempAvgF     0.157353
L4.TempAvgF     0.025721
L5.TempAvgF     0.041342
L6.TempAvgF     0.030831
L7.TempAvgF    -0.021230
L8.TempAvgF     0.020324
L9.TempAvgF     0.025147
L10.TempAvgF    0.059739
L11.TempAvgF   -0.017337
L12.TempAvgF    0.043553
L13.TempAvgF   -0.027795
L14.TempAvgF    0.053547
L15.TempAvgF    0.013070
L16.TempAvgF   -0.033157
L17.TempAvgF   -0.000072
L18.TempAvgF   -0.026307
L19.TempAvgF    0.025258
L20.TempAvgF    0.038341
L21.TempAvgF    0.007885
L22.TempAvgF   -0.008889
L23.TempAvgF   -0.011080
dtype: float64

Create a set of predictions for 1,000 days after the last sample:

predictions = model_fit.predict(start=model_fit.k_ar, end=len(df) + 1000)
predictions[:10].values

The output will be:

array([54.81171857, 56.89097085, 56.41891585, 50.98627626, 56.11843512,
       53.20665111, 55.13941554, 58.4679288 , 61.92497136, 49.46049801])

Plot the predictions, as well as the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.96: Plot of temperature through the years

Enhance the view to look for differences by showing the 100th to 200th samples:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.xlim([100, 200])
plt.legend();

We'll get the following output:

Figure 3.97: Plot of predictions with the original dataset

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Related Content you might be interested in

Current Title:

Applied Supervised Learning with Python

Data Science for Marketing Analytics

Ensemble Machine Learning Cookbook

Machine Learning with scikit-learn Quick Start Guide

Chapter 3: Regression Analysis

Activity 5: Plotting Data with a Moving Average

Activity 6: Linear Regression Using the Least Squares Method

Activity 7: Dummy Variables

Activity 8: Other Model Types with Linear Regression

Activity 9: Gradient Descent

Activity 10: Autoregressors