#### Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Applied Supervised Learning with Python
Preface
Free Chapter
Python Machine Learning Toolkit
Exploratory Data Analysis and Visualization
Regression Analysis
Classification
Ensemble Modeling
Model Evaluation

## Chapter 3: Regression Analysis

### Activity 5: Plotting Data with a Moving Average

Solution

1. Load the dataset into a pandas DataFrame from the CSV file:

The output will show the initial five rows of the austin_weather.csv file:

Figure 3.74: The first five rows of the Austin weather data

2. Since we only need the Date and TempAvgF columns, we'll remove all others from the dataset:

df = df[['Date', 'TempAvgF']]

The output will be:

Figure 3.75: Date and TempAvgF columns of the Austin weather data

3. Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column. Note that temperatures are recorded daily:

df['Year'] = [int(dt[:4]) for dt in df.Date]

The output will be:

Figure 3.76: Extracting the year

4. Repeat this process to extract the month values and store the values as integers in a Month column:

df['Month'] = [int(dt[5:7]) for dt in df.Date]

The output will be:

Figure 3.77: Extracting the month

5. Copy the first year's worth of data to a DataFrame:

df_first_year = df[:365]

The output will be as follows:

Figure 3.78: Copied data to new DataFrame

6. Compute a 20-day moving average filter:

window = 20
rolling = df_first_year.TempAvgF.rolling(window).mean();

The output will be:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19    47.75
Name: TempAvgF, dtype: float64
7. Plot the raw data and the moving average signal, with the x axis as the day number in the year:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(range(1, 366), df_first_year.TempAvgF, label='Raw Data');
ax.plot(range(1, 366), rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.79: Scatter plot of temperature throughout the year

### Activity 6: Linear Regression Using the Least Squares Method

Solution

1. Visualize the measurements:

The output will be as follows:

Figure 3.80: First five rows of activity2_measurements.csv dataset

2. Visualize the rolling average values:

The output will be as follows:

3. Create a linear regression model using the default parameters; that is, calculate a y intercept for the model and do not normalize the data:

model = LinearRegression()
model

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
4. Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. To make later calculations easier, insert a column (DayOfYear) that corresponds with the day of the year for that measurement:

df_first_year.loc[:,'DayOfYear'] = [i + 1 for i in df_first_year.index]

The output will be as follows:

Figure 3.82: Adding day of year column

5. Fit the model with the DayOfYear values as the input and df_first_year.TempAvgF as the output:

# Note the year values need to be provided as an N x 1 array
model.fit(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
6. Print the parameters of the model:

print(f'm = {model.coef_[0]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x + {model.intercept_:0.4f}')

The output will be:

m = 0.04909173467448788
c = 60.28196597922625

Model Definition
y = 0.04909x + 60.2820
7. We can calculate the trendline values by using the first, middle, and last values (days in years) in the linear equation:

trend_x = np.array([
1,
182.5,
365
])

trend_y = model.predict(trend_x.reshape((-1, 1)))
trend_y

The output will be as follows:

array([60.33105771, 69.24120756, 78.20044914])
8. Plot these values with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.83: Scatterplot of temperature thought the year with the predicted trendline

9. Evaluate the performance of the model. How well does the model fit the data? Calculate the r2 score to find out:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.1222

### Activity 7: Dummy Variables

Solution

1. Plot the raw data (df) and moving average (rolling):

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be:

Figure 3.84: Scatterplot of Temperature throughout the year

2. Looking at the preceding plot, there seems to be an inflection point around day 250. Create a dummy variable to introduce this feature into the linear model:

df_first_year.loc[:,'inflection'] = [1 * int(i < 250) for i in df_first_year.DayOfYear]
3. Check the first and last samples to confirm that the dummy variable is correct. Check the first five samples:

The output will be as follows:

Figure 3.85: First five columns

Then, check the last five samples:

df_first_year.tail()

The output will be:

Figure 3.86: Last five columns

4. Use a least squares linear regression model and fit the model to the DayOfYear values and the dummy variable to predict TempAvgF:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
5. Compute the r2 score:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be as follows:

r2 score = 0.3631
6. Using the DayOfYear values, create a set of predictions using the model to construct a trendline:

trend_y = model.predict(df_first_year[['DayOfYear', 'inflection']].values)
trend_y

The output will be:

array([51.60311133, 51.74622654, 51.88934175, 52.03245696, 52.17557217,
52.31868739, 52.4618026 , 52.60491781, 52.74803302, 52.89114823,
53.03426345, 53.17737866, 53.32049387, 53.46360908, 53.60672429,
53.7498395 , 53.89295472, 54.03606993, 54.17918514, 54.32230035,
54.46541556, 54.60853078, 54.75164599, 54.8947612 , 55.03787641,
…
…
73.88056649, 74.0236817 , 74.16679692, 74.30991213, 74.45302734,
74.59614255, 74.73925776, 74.88237297, 75.02548819, 75.1686034 ,
75.31171861, 75.45483382, 75.59794903, 75.74106425, 75.88417946,
76.02729467, 76.17040988, 76.31352509, 76.4566403 , 76.59975552,
76.74287073, 76.88598594, 77.02910115, 77.17221636, 77.31533157])
7. Plot the trendline against the data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.87: Predicted trendline

### Activity 8: Other Model Types with Linear Regression

Solution

1. Use a sine curve function as the basis of the model:

# Using a sine curve
df_first_year['DayOfYear2'] = np.sin(df_first_year['DayOfYear'] / df_first_year['DayOfYear'].max())

The output will be as follows:

Figure 3.88: First five rows

2. Fit the model:

# Note the year values need to be provided as an N x 1 array
model = LinearRegression()
model.fit(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)

The output will be as follows:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
3. Print the parameters of the model:

print(f'a = {model.coef_[0]}')
print(f'm = {model.coef_[1]}')
print(f'c = {model.intercept_}')

print('\nModel Definition')
print(f'y = {model.coef_[0]:0.4}x^2 + {model.coef_[1]:0.4}x + {model.intercept_:0.4f}')

The output will be as follows:

a = 634.322313570282
m = -1.4371290614190075
c = 39.93286585807408

Model Definition
y = 634.3x^2 + -1.437x + 39.9329
4. Compute the r2 value to measure the performance:

# Note the year values need to be provided as an N x 1 array
r2 = model.score(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)
print(f'r2 score = {r2:0.4f}')

The output will be:

r2 score = 0.7047
5. Construct the trendline values:

trend_y = model.predict(df_first_year[['DayOfYear2', 'DayOfYear']].values)
trend_y

The output will be:

array([40.23360397, 40.53432905, 40.83502803, 41.13568788, 41.43629555,
41.736838  , 42.03730219, 42.33767507, 42.6379436 , 42.93809474,
43.23811546, 43.5379927 , 43.83771344, 44.13726463, 44.43663324,
44.73580624, 45.03477059, 45.33351327, 45.63202123, 45.93028146,
46.22828093, 46.52600661, 46.82344549, 47.12058453, 47.41741073,
…
…
59.96306563, 59.55705293, 59.14720371, 58.73351024, 58.31596484,
57.89455987, 57.46928769, 57.04014072, 56.60711138, 56.17019215,
55.7293755 , 55.28465397, 54.83602011, 54.38346649, 53.92698572,
53.46657045, 53.00221334, 52.53390709, 52.06164442, 51.58541811,
51.10522093, 50.62104569, 50.13288526, 49.6407325 , 49.14458033])
6. Plot the trendline with the raw data and the moving average:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.89: Predicted trendline

Solution

1. Create a generic gradient descent model and normalize the day of year values as between 0 and 1:

_x = df_first_year.DayOfYear / df_first_year.DayOfYear.max()
2. Fit the model:

The output will be as follows:

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
eta0=0.01, fit_intercept=True, l1_ratio=0.15,
learning_rate='invscaling', loss='squared_loss', max_iter=None,
n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
verbose=0, warm_start=False)
3. Print the details of the model:

print('\nModel Definition')

The output will be as follows:

m = 26.406162532140563
c = 55.07470859678077

Model Definition
y = 26.41x + 55.0747
4. Prepare the x (_trend_x) trendline values by dividing them by the maximum. Predict y_trend_values using the gradient descent model:

_trend_x = trend_x / trend_x.max()
trend_y

The output will be as follows:

array([55.14705425, 68.27778986, 81.48087113])
5. Plot the data and the moving average with the trendline:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_axes([1, 1, 1, 1]);

# Temp measurements
ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
ax.plot(trend_x, trend_y, c='k', linestyle='--', label='Model: Predicted trendline')

ax.set_title('Daily Mean Temperature Measurements')
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (degF)')
ax.set_xticks(range(1, 366), 10)
ax.legend();

The output will be as follows:

Figure 3.90: Gradient descent predicted trendline

### Activity 10: Autoregressors

Solution

1. Plot the complete set of average temperature values (df.TempAvgF) with years on the x axis:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values);
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');

The output will be:

Figure 3.91: Plot of temperature through the year

2. Create a 20-day lag and plot the lagged data on the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(df.TempAvgF.shift(20), c='r', linestyle='--',
label='Lag 20');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.92: Plot of temperature through the years with a 20-day lag

3. Construct an autocorrelation plot to see whether the average temperature can be used with an autoregressor. Where is the lag acceptable and where is it not for an autoregressor?

plt.figure(figsize=(10, 7))
pd.plotting.autocorrelation_plot(df.TempAvgF);

We'll get the following output:

Figure 3.93: Plot of autocorrelation versus lag

The lag is acceptable only when the autocorrelation line lies outside the 99% confidence bounds, as represented by the dashed lines.

4. Chose an acceptable lag and an unacceptable lag and construct lag plots using these values for acceptable lag:

plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=5);

We'll get the following output:

Figure 3.94: Plot of acceptable lag

Use these values for unacceptable lag:

plt.figure(figsize=(10,7))
ax = pd.plotting.lag_plot(df.TempAvgF, lag=1000);

We'll get the following output:

Figure 3.95: Plot of unacceptable lag

5. Create an autoregressor model, note the selected lag, calculate the R2 value, and plot the autoregressor model with the original plot. The model is to project past the available data by 1,000 samples:

from statsmodels.tsa.ar_model import AR
model = AR(df.TempAvgF)
6. Fit the model to the data:

model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

The output will be:

Lag: 23
Coefficients: const           1.909395
L1.TempAvgF     0.912076
L2.TempAvgF    -0.334043
L3.TempAvgF     0.157353
L4.TempAvgF     0.025721
L5.TempAvgF     0.041342
L6.TempAvgF     0.030831
L7.TempAvgF    -0.021230
L8.TempAvgF     0.020324
L9.TempAvgF     0.025147
L10.TempAvgF    0.059739
L11.TempAvgF   -0.017337
L12.TempAvgF    0.043553
L13.TempAvgF   -0.027795
L14.TempAvgF    0.053547
L15.TempAvgF    0.013070
L16.TempAvgF   -0.033157
L17.TempAvgF   -0.000072
L18.TempAvgF   -0.026307
L19.TempAvgF    0.025258
L20.TempAvgF    0.038341
L21.TempAvgF    0.007885
L22.TempAvgF   -0.008889
L23.TempAvgF   -0.011080
dtype: float64
7. Create a set of predictions for 1,000 days after the last sample:

predictions = model_fit.predict(start=model_fit.k_ar, end=len(df) + 1000)
predictions[:10].values

The output will be:

array([54.81171857, 56.89097085, 56.41891585, 50.98627626, 56.11843512,
53.20665111, 55.13941554, 58.4679288 , 61.92497136, 49.46049801])
8. Plot the predictions, as well as the original dataset:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.legend();

The output will be:

Figure 3.96: Plot of temperature through the years

9. Enhance the view to look for differences by showing the 100th to 200th samples:

plt.figure(figsize=(10, 7))
plt.plot(df.TempAvgF.values, label='Original Dataset');
plt.plot(predictions, c='g', linestyle=':', label='Predictions');
yrs = [yr for yr in df.Year.unique()]
plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
plt.title('Austin Texas Average Daily Temperature');
plt.xlabel('Year');
plt.ylabel('Temperature (F)');
plt.xlim([100, 200])
plt.legend();

We'll get the following output:

Figure 3.97: Plot of predictions with the original dataset