Book Image

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur
Book Image

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!
Table of Contents (9 chapters)

Chapter 3: Regression Analysis


Activity 5: Plotting Data with a Moving Average

Solution

  1. Load the dataset into a pandas DataFrame from the CSV file:

    df = pd.read_csv('austin_weather.csv')
    df.head()

    The output will show the initial five rows of the austin_weather.csv file:

    Figure 3.74: The first five rows of the Austin weather data

  2. Since we only need the Date and TempAvgF columns, we'll remove all others from the dataset:

    df = df[['Date', 'TempAvgF']]
    df.head()

    The output will be:

    Figure 3.75: Date and TempAvgF columns of the Austin weather data

  3. Initially, we are only interested in the first year's data, so we need to extract that information only. Create a column in the DataFrame for the year value, extract the year value as an integer from the strings in the Date column, and assign these values to the Year column. Note that temperatures are recorded daily:

    df['Year'] = [int(dt[:4]) for dt in df.Date]
    df.head()

    The output will be:

    Figure 3.76: Extracting the year

  4. Repeat this process to extract the month values and store the values as integers in a Month column:

    df['Month'] = [int(dt[5:7]) for dt in df.Date]
    df.head()

    The output will be:

    Figure 3.77: Extracting the month

  5. Copy the first year's worth of data to a DataFrame:

    df_first_year = df[:365]
    df_first_year.head()

    The output will be as follows:

    Figure 3.78: Copied data to new DataFrame

  6. Compute a 20-day moving average filter:

    window = 20
    rolling = df_first_year.TempAvgF.rolling(window).mean();
    rolling.head(n=20)

    The output will be:

    0       NaN
    1       NaN
    2       NaN
    3       NaN
    4       NaN
    5       NaN
    6       NaN
    7       NaN
    8       NaN
    9       NaN
    10      NaN
    11      NaN
    12      NaN
    13      NaN
    14      NaN
    15      NaN
    16      NaN
    17      NaN
    18      NaN
    19    47.75
    Name: TempAvgF, dtype: float64
  7. Plot the raw data and the moving average signal, with the x axis as the day number in the year:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(range(1, 366), df_first_year.TempAvgF, label='Raw Data');
    ax.plot(range(1, 366), rolling, c='r', label=f'{window} day moving average');
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be as follows:

    Figure 3.79: Scatter plot of temperature throughout the year

Activity 6: Linear Regression Using the Least Squares Method

Solution

  1. Visualize the measurements:

    df.head()

    The output will be as follows:

    Figure 3.80: First five rows of activity2_measurements.csv dataset

  2. Visualize the rolling average values:

    rolling.head(n=30)

    The output will be as follows:

    Figure 3.81: Rolling head average

  3. Create a linear regression model using the default parameters; that is, calculate a y intercept for the model and do not normalize the data:

    model = LinearRegression()
    model

    The output will be as follows:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
  4. Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. To make later calculations easier, insert a column (DayOfYear) that corresponds with the day of the year for that measurement:

    df_first_year.loc[:,'DayOfYear'] = [i + 1 for i in df_first_year.index]
    df_first_year.head()

    The output will be as follows:

    Figure 3.82: Adding day of year column

  5. Fit the model with the DayOfYear values as the input and df_first_year.TempAvgF as the output:

    # Note the year values need to be provided as an N x 1 array
    model.fit(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)

    The output will be as follows:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
  6. Print the parameters of the model:

    print(f'm = {model.coef_[0]}')
    print(f'c = {model.intercept_}')
    
    print('\nModel Definition')
    print(f'y = {model.coef_[0]:0.4}x + {model.intercept_:0.4f}')

    The output will be:

    m = 0.04909173467448788
    c = 60.28196597922625
    
    Model Definition
    y = 0.04909x + 60.2820
  7. We can calculate the trendline values by using the first, middle, and last values (days in years) in the linear equation:

    trend_x = np.array([
        1,
        182.5,
        365
    ])
    
    trend_y = model.predict(trend_x.reshape((-1, 1)))
    trend_y

    The output will be as follows:

    array([60.33105771, 69.24120756, 78.20044914])
  8. Plot these values with the trendline:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
    ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
    ax.plot(trend_x, trend_y, c='k', label='Model: Predicted trendline')
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be as follows:

    Figure 3.83: Scatterplot of temperature thought the year with the predicted trendline

  9. Evaluate the performance of the model. How well does the model fit the data? Calculate the r2 score to find out:

    # Note the year values need to be provided as an N x 1 array
    r2 = model.score(df_first_year.DayOfYear.values.reshape((-1, 1)), df_first_year.TempAvgF)
    print(f'r2 score = {r2:0.4f}')

    The output will be:

    r2 score = 0.1222

Activity 7: Dummy Variables

Solution

  1. Plot the raw data (df) and moving average (rolling):

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
    ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be:

    Figure 3.84: Scatterplot of Temperature throughout the year

  2. Looking at the preceding plot, there seems to be an inflection point around day 250. Create a dummy variable to introduce this feature into the linear model:

    df_first_year.loc[:,'inflection'] = [1 * int(i < 250) for i in df_first_year.DayOfYear]
  3. Check the first and last samples to confirm that the dummy variable is correct. Check the first five samples:

    df_first_year.head()

    The output will be as follows:

    Figure 3.85: First five columns

    Then, check the last five samples:

    df_first_year.tail()

    The output will be:

    Figure 3.86: Last five columns

  4. Use a least squares linear regression model and fit the model to the DayOfYear values and the dummy variable to predict TempAvgF:

    # Note the year values need to be provided as an N x 1 array
    model = LinearRegression()
    model.fit(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)

    The output will be as follows:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
  5. Compute the r2 score:

    # Note the year values need to be provided as an N x 1 array
    r2 = model.score(df_first_year[['DayOfYear', 'inflection']], df_first_year.TempAvgF)
    print(f'r2 score = {r2:0.4f}')

    The output will be as follows:

    r2 score = 0.3631
  6. Using the DayOfYear values, create a set of predictions using the model to construct a trendline:

    trend_y = model.predict(df_first_year[['DayOfYear', 'inflection']].values)
    trend_y

    The output will be:

    array([51.60311133, 51.74622654, 51.88934175, 52.03245696, 52.17557217,
           52.31868739, 52.4618026 , 52.60491781, 52.74803302, 52.89114823,
           53.03426345, 53.17737866, 53.32049387, 53.46360908, 53.60672429,
           53.7498395 , 53.89295472, 54.03606993, 54.17918514, 54.32230035,
           54.46541556, 54.60853078, 54.75164599, 54.8947612 , 55.03787641,
    …
    …
           73.88056649, 74.0236817 , 74.16679692, 74.30991213, 74.45302734,
           74.59614255, 74.73925776, 74.88237297, 75.02548819, 75.1686034 ,
           75.31171861, 75.45483382, 75.59794903, 75.74106425, 75.88417946,
           76.02729467, 76.17040988, 76.31352509, 76.4566403 , 76.59975552,
           76.74287073, 76.88598594, 77.02910115, 77.17221636, 77.31533157])
  7. Plot the trendline against the data and the moving average:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
    ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
    ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be as follows:

    Figure 3.87: Predicted trendline

Activity 8: Other Model Types with Linear Regression

Solution

  1. Use a sine curve function as the basis of the model:

    # Using a sine curve
    df_first_year['DayOfYear2'] = np.sin(df_first_year['DayOfYear'] / df_first_year['DayOfYear'].max())
    df_first_year.head()

    The output will be as follows:

    Figure 3.88: First five rows

  2. Fit the model:

    # Note the year values need to be provided as an N x 1 array
    model = LinearRegression()
    model.fit(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)

    The output will be as follows:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
  3. Print the parameters of the model:

    print(f'a = {model.coef_[0]}')
    print(f'm = {model.coef_[1]}')
    print(f'c = {model.intercept_}')
    
    print('\nModel Definition')
    print(f'y = {model.coef_[0]:0.4}x^2 + {model.coef_[1]:0.4}x + {model.intercept_:0.4f}')

    The output will be as follows:

    a = 634.322313570282
    m = -1.4371290614190075
    c = 39.93286585807408
    
    Model Definition
    y = 634.3x^2 + -1.437x + 39.9329
  4. Compute the r2 value to measure the performance:

    # Note the year values need to be provided as an N x 1 array
    r2 = model.score(df_first_year[['DayOfYear2', 'DayOfYear']], df_first_year.TempAvgF)
    print(f'r2 score = {r2:0.4f}')

    The output will be:

    r2 score = 0.7047
  5. Construct the trendline values:

    trend_y = model.predict(df_first_year[['DayOfYear2', 'DayOfYear']].values)
    trend_y

    The output will be:

    array([40.23360397, 40.53432905, 40.83502803, 41.13568788, 41.43629555,
           41.736838  , 42.03730219, 42.33767507, 42.6379436 , 42.93809474,
           43.23811546, 43.5379927 , 43.83771344, 44.13726463, 44.43663324,
           44.73580624, 45.03477059, 45.33351327, 45.63202123, 45.93028146,
           46.22828093, 46.52600661, 46.82344549, 47.12058453, 47.41741073,
    …
    …
           59.96306563, 59.55705293, 59.14720371, 58.73351024, 58.31596484,
           57.89455987, 57.46928769, 57.04014072, 56.60711138, 56.17019215,
           55.7293755 , 55.28465397, 54.83602011, 54.38346649, 53.92698572,
           53.46657045, 53.00221334, 52.53390709, 52.06164442, 51.58541811,
           51.10522093, 50.62104569, 50.13288526, 49.6407325 , 49.14458033])
  6. Plot the trendline with the raw data and the moving average:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
    ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
    ax.plot(df_first_year.DayOfYear, trend_y, c='k', label='Model: Predicted trendline')
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be as follows:

    Figure 3.89: Predicted trendline

Activity 9: Gradient Descent

Solution

  1. Create a generic gradient descent model and normalize the day of year values as between 0 and 1:

    grad_model = SGDRegressor(max_iter=None, tol=1e-3)
    _x = df_first_year.DayOfYear / df_first_year.DayOfYear.max()
  2. Fit the model:

    grad_model.fit(_x.values.reshape((-1, 1)), df_first_year.TempAvgF)

    The output will be as follows:

    SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
           eta0=0.01, fit_intercept=True, l1_ratio=0.15,
           learning_rate='invscaling', loss='squared_loss', max_iter=None,
           n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
           random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
           verbose=0, warm_start=False)
  3. Print the details of the model:

    print(f'm = {grad_model.coef_[0]}')
    print(f'c = {grad_model.intercept_[0]}')
    
    print('\nModel Definition')
    print(f'y = {grad_model.coef_[0]:0.4}x + {grad_model.intercept_[0]:0.4f}')

    The output will be as follows:

    m = 26.406162532140563
    c = 55.07470859678077
    
    Model Definition
    y = 26.41x + 55.0747
  4. Prepare the x (_trend_x) trendline values by dividing them by the maximum. Predict y_trend_values using the gradient descent model:

    _trend_x = trend_x / trend_x.max()
    trend_y = grad_model.predict(_trend_x.reshape((-1, 1)))
    trend_y

    The output will be as follows:

    array([55.14705425, 68.27778986, 81.48087113])
  5. Plot the data and the moving average with the trendline:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_axes([1, 1, 1, 1]);
    
    # Temp measurements
    ax.scatter(df_first_year.DayOfYear, df_first_year.TempAvgF, label='Raw Data');
    ax.plot(df_first_year.DayOfYear, rolling, c='r', label=f'{window} day moving average');
    ax.plot(trend_x, trend_y, c='k', linestyle='--', label='Model: Predicted trendline')
    
    ax.set_title('Daily Mean Temperature Measurements')
    ax.set_xlabel('Day')
    ax.set_ylabel('Temperature (degF)')
    ax.set_xticks(range(1, 366), 10)
    ax.legend();

    The output will be as follows:

    Figure 3.90: Gradient descent predicted trendline

Activity 10: Autoregressors

Solution

  1. Plot the complete set of average temperature values (df.TempAvgF) with years on the x axis:

    plt.figure(figsize=(10, 7))
    plt.plot(df.TempAvgF.values);
    yrs = [yr for yr in df.Year.unique()]
    plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
    plt.title('Austin Texas Average Daily Temperature');
    plt.xlabel('Year');
    plt.ylabel('Temperature (F)');

    The output will be:

    Figure 3.91: Plot of temperature through the year

  2. Create a 20-day lag and plot the lagged data on the original dataset:

    plt.figure(figsize=(10, 7))
    plt.plot(df.TempAvgF.values, label='Original Dataset');
    plt.plot(df.TempAvgF.shift(20), c='r', linestyle='--',
        label='Lag 20');
    yrs = [yr for yr in df.Year.unique()]
    plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
    plt.title('Austin Texas Average Daily Temperature');
    plt.xlabel('Year');
    plt.ylabel('Temperature (F)');
    plt.legend();

    The output will be:

    Figure 3.92: Plot of temperature through the years with a 20-day lag

  3. Construct an autocorrelation plot to see whether the average temperature can be used with an autoregressor. Where is the lag acceptable and where is it not for an autoregressor?

    plt.figure(figsize=(10, 7))
    pd.plotting.autocorrelation_plot(df.TempAvgF);

    We'll get the following output:

    Figure 3.93: Plot of autocorrelation versus lag

    The lag is acceptable only when the autocorrelation line lies outside the 99% confidence bounds, as represented by the dashed lines.

  4. Chose an acceptable lag and an unacceptable lag and construct lag plots using these values for acceptable lag:

    plt.figure(figsize=(10,7))
    ax = pd.plotting.lag_plot(df.TempAvgF, lag=5);

    We'll get the following output:

    Figure 3.94: Plot of acceptable lag

    Use these values for unacceptable lag:

    plt.figure(figsize=(10,7))
    ax = pd.plotting.lag_plot(df.TempAvgF, lag=1000);

    We'll get the following output:

    Figure 3.95: Plot of unacceptable lag

  5. Create an autoregressor model, note the selected lag, calculate the R2 value, and plot the autoregressor model with the original plot. The model is to project past the available data by 1,000 samples:

    from statsmodels.tsa.ar_model import AR
    model = AR(df.TempAvgF)
  6. Fit the model to the data:

    model_fit = model.fit()
    print('Lag: %s' % model_fit.k_ar)
    print('Coefficients: %s' % model_fit.params)

    The output will be:

    Lag: 23
    Coefficients: const           1.909395
    L1.TempAvgF     0.912076
    L2.TempAvgF    -0.334043
    L3.TempAvgF     0.157353
    L4.TempAvgF     0.025721
    L5.TempAvgF     0.041342
    L6.TempAvgF     0.030831
    L7.TempAvgF    -0.021230
    L8.TempAvgF     0.020324
    L9.TempAvgF     0.025147
    L10.TempAvgF    0.059739
    L11.TempAvgF   -0.017337
    L12.TempAvgF    0.043553
    L13.TempAvgF   -0.027795
    L14.TempAvgF    0.053547
    L15.TempAvgF    0.013070
    L16.TempAvgF   -0.033157
    L17.TempAvgF   -0.000072
    L18.TempAvgF   -0.026307
    L19.TempAvgF    0.025258
    L20.TempAvgF    0.038341
    L21.TempAvgF    0.007885
    L22.TempAvgF   -0.008889
    L23.TempAvgF   -0.011080
    dtype: float64
  7. Create a set of predictions for 1,000 days after the last sample:

    predictions = model_fit.predict(start=model_fit.k_ar, end=len(df) + 1000)
    predictions[:10].values

    The output will be:

    array([54.81171857, 56.89097085, 56.41891585, 50.98627626, 56.11843512,
           53.20665111, 55.13941554, 58.4679288 , 61.92497136, 49.46049801])
  8. Plot the predictions, as well as the original dataset:

    plt.figure(figsize=(10, 7))
    plt.plot(df.TempAvgF.values, label='Original Dataset');
    plt.plot(predictions, c='g', linestyle=':', label='Predictions');
    yrs = [yr for yr in df.Year.unique()]
    plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
    plt.title('Austin Texas Average Daily Temperature');
    plt.xlabel('Year');
    plt.ylabel('Temperature (F)');
    plt.legend();

    The output will be:

    Figure 3.96: Plot of temperature through the years

  9. Enhance the view to look for differences by showing the 100th to 200th samples:

    plt.figure(figsize=(10, 7))
    plt.plot(df.TempAvgF.values, label='Original Dataset');
    plt.plot(predictions, c='g', linestyle=':', label='Predictions');
    yrs = [yr for yr in df.Year.unique()]
    plt.xticks(np.arange(0, len(df), len(df) // len(yrs)), yrs);
    plt.title('Austin Texas Average Daily Temperature');
    plt.xlabel('Year');
    plt.ylabel('Temperature (F)');
    plt.xlim([100, 200])
    plt.legend();

    We'll get the following output:

    Figure 3.97: Plot of predictions with the original dataset