Book Image

Machine Learning Fundamentals

By : Hyatt Saleh
Book Image

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (9 chapters)
Machine Learning Fundamentals
Preface

Chapter 6: Building Your Own Program


Activity 16: Performing the Preparation and Creation Stages for the Bank Marketing Dataset

For the purpose of this demonstration, a random_state equal to 100 will be used for the following solution:

  1. Open a Jupyter Notebook to implement this activity and import pandas:

    import pandas as pd
  2. Load the previously downloaded dataset into the notebook:

    data = pd.read_csv("../datasets/bank-full.csv")

    The first 10 rows of the dataset can be seen using the statement data.head(10):

    Figure 6.6: A screenshot showing the first 10 instances of the dataset

    The missing values are shown as NaN, as explained previously.

  3. Select the metric that's the most appropriate for measuring the performance of the model, considering that the purpose of the study is to detect clients who would subscribe to the term deposit.

    The metric to evaluate the performance of the model is the precision metric, as it compares the correctly classified positive labels against the total number of instances predicted as positive.

  4. Preprocess the dataset.

    Handling Missing Values

    Use the following code to check for missing values:

    data.isnull().sum()

    Based on the results, you will observe that only four features contain missing values: job (288), education (1,857), contact (13,020), and poutcome (36,959).

    The first two features can be left unhandled considering that the missing values represent less than the 5% of the entire data. On the other hand, 28.8% of the values are missing from the contact feature, and taking into account that the feature refers to the mode of contact, which is irrelevant for determining whether a person will subscribe to a new product, it is safe to remove this feature from the study. Finally, the poutcome feature is missing 81.7% of its values, which is why this feature is also removed from the study.

    Using the following code, the preceding two features are dropped:

    data = data.drop(["contact", "poutcome"], axis=1)

    Converting the Categorical Features into Numeric Form

    For all nominal features, use the following code:

    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    
    features_to_convert = ["job", "marital", "default", "housing", "loan", "month", "y"]
    
    for i in features_to_convert:
      data[i] = enc.fit_transform(data[i].astype("str"))

    The preceding code, as explained in previous chapters, converts all the qualitative features into their numeric forms.

    Next, to handle the ordinal feature, we must use the following code:

    data["education"] = data["education"].fillna["unknown"]
    encoder = ["unknown", "primary", "secondary", "tertiary"]
    
    for i, word in enumerate(encoder):
      data["education"] = data["education"].str.replace(word,str(i))
      data["education"] = data["education"].astype("int64")

    Here, the first line converts NaN values to the word unknown and the second line sets the order of the values in the feature. Next, a for loop is used to replace each word for a number that follows an order. For the preceding example, 0 will be used to replace the word unknown, then 1 will be used instead of primary, and so on. Finally, the whole column is converted into an integer type since the replace function writes down the numbers as strings.

    Dealing with Outliers

    Use the following code to check for outliers:

    outliers = []
    
    for i in range(data.shape[1]):
      min_t = data[data.columns[i]].mean() – (3 * data[data.columns[i]].std())
      max_t = data[data.columns[i]].mean() + (3 * data[data.columns[i]].std())
      count = 0
    
      for j in data[data.columns[i]]:
        if j < min_t or j > max_t:
          count += 1
    
      outliers[data.columns[i]] = [count, data.shape[0]-count]

    By analyzing the results from the preceding code, you will observe that the outliers do not account for more than 5% of the total values in each feature, which is why they can be left unhandled.

  5. Separate the features from the class label and split the dataset into three sets (training, validation, and testing).

    To separate the features from the target value, use the following code:

    X = data.drop("y", axis = 1)
    Y = data["y"]

    Next, to perform a split of the form 60/20/20%, use the following code:

    from sklearn.model_selection import train_test_split
    X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
    X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size = 0.25, random_state = 0)

    The shape of each set is as follows:

    X_train = (27126, 14)
    Y_train = (27126, )
    X_dev = (9042, 14)
    Y_dev = (9042, )
    X_test = (9043, 14)
    Y_test = (9043, )
  6. Use the Decision Tree and the Multilayer Perceptron algorithms to apply over the dataset and train the models.

    By using the following code, both algorithms can be trained:

    from sklearn.tree import DecisionTreeClassifier
    model_tree = DecisionTreeClassifier(random_state = 101)
    model_tree.fit(X_train, Y_train)
    
    from sklearn.neural_network import MLPClassifier
    model_NN = MLPClassifier(random_state = 101)
    model_NN.fit(X_train, Y_train)
  7. Evaluate both models by using the metric that was selected previously.

    Using the following code, it is possible to measure the precision score of the Decision Tree model:

    from sklearn.metrics import precision_score
    X_sets = [X_train, X_dev, X_test]
    Y_sets = [Y_train, Y_dev, Y_test]
    
    precision = []
    
    for i in range(0, len(X_sets)):
      pred = model_tree.predict(X_sets[i])
      score = precision_score(Y_sets[i], pred)
      precision.append(score)

    The same code can be modified to calculate the score for the Multilayer Perceptron.

    The results from the code are shown in the following table:

    Figure 6.7: Precision scores for both models

  8. Fine-tune some of the hyperparameters to fix the issues detected during the evaluation of the model by performing error analysis.

    Although the precision of the decision tree over the training sets is perfect, on comparing it against the results of the other two sets, it is possible to conclude that the model suffers from high variance.

    On the other hand, the Multilayer Perceptron has a similar performance on all three sets, but the overall performance is low, which means that the model is more likely to be suffering from high bias.

    Considering this, for the decision tree model, both the minimum number of samples required to be at a leaf node and the maximum depth of the tree are changed in order to simplify the model. On the other hand, for the Multilayer Perceptron, the number of iterations, the number of hidden layers, the number of units in each layer, and the tolerance for optimization are changed.

    The following code shows the final values used for each hyperparameter, considering that to arrive at them it is required to try different values:

    from sklearn.tree import DecisionTreeClassifier
    model_tree = DecisionTreeClassifier(randome_state = 101, min_samples_leaf = 100, max_depth = 100)
    model_tree.fit(X_train, Y_train)
    
    from sklearn.neural_network import MLPClassifier
    model_NN = MLPClassifier(random_state = 101, max_iter = 1000, hidden_layer_sizes = [100,100,50,25,25], tol=1e-7)
    model_NN.fit(X_train, Y_train)
  9. Compare the final versions of your models and select the one that you consider best fits the data.

    By calculating the precision score for all three sets for the newly trained models, we obtain the following values:

    Figure 6.8: Precision scores for the newly trained models

    An improvement in performance for both models is achieved, and by comparing the values, it is possible to conclude that the Multilayer Perceptron outperforms the Decision Tree. Based on this, the Multilayer Perceptron is selected as the better model to solve the data problem.

Activity 17: Saving and Loading the Final Model for the Bank Marketing Dataset

  1. Save the model into a file named final_model.pkl:

    path = os.getcwd() + "/final_model.pkl"
    file = open(path, "wb")
    pickle.dump(model_NN, file)
  2. Open a new Jupyter Notebook and import the required modules and class:

    from sklearn.neural_network import MLPClassifier
    import pickle
    import os
  3. Load the model:

    path = os.getcwd() + "/final_model.pkl"
    file = open(path, "rb")
    model = pickle.load(file)
  4. Perform a prediction for an individual by using the following values:

    42, 2, 0, 0, 1, 2, 1, 0, 5, 8, 380, 1, -1, 0.

    pred = model.predict([[42,2,0,0,1,2,1,0,5,8,380,1,-1,0]])

    By printing the pred variable, the output is 0, which is the numeric form of No. This means that the individual is more likely to not subscribe to the new product.

Activity 18: Allowing Interaction with the Bank Marketing Dataset Model

  1. In a text editor, create a class object that contains two main functions. One should be an initializer that loads the model, and the other should be a predict method where the data is fed to the model to retrieve an output:

    import pandas as pd
    import pickle
    import os
    from sklearn.neural_network import MLPClassifier
    
    Class NN_Model(object):
    
      def __init__(self):
        path = os.getcwd() + "/model_exercise.pkl"
        file = open(path, "rb")
        self.model = pickle.load(file)
    
      def predict(self, age, job, marital, education, default, balance, housing, loan, day, month, duration, campaign, pdays, previous):
        X = [[age, job, marital, education, default, balance, housing, loan, day, month, duration, campaign, pdays, previous]]
        return self.model.predict(X)
  2. In a Jupyter Notebook, import and initialize the class that you created in the last step. Next, create the variables that will hold the values for the features and use the following values: 42, 2, 0, 0, 1, 2, 1, 0, 5, 8, 380, 1, -1, 0.

    from trainedModel import NN_Model
    
    model = NN_Model()
    
    age = 42
    job = 2
    marital = 0
    education = 0
    default = 1
    balance = 2
    housing = 1
    loan = 0
    day = 5
    month = 8
    duration = 380
    campaign = 1
    pdays = -1
    previous = 0
  3. Perform a prediction by applying the predict method:

    pred = model.predict(age=age, job=job, marital=marital, education=education, default=default, balance=balance, housing=housing, loan=loan, day=day, month=month, duration=duration, campaign=campaign, pdays=pdays, previous=previous)

    By printing the variable, the prediction is equal to 0; that is, the individual with the given features is not likely to subscribe to the product.