Book Image

Machine Learning Fundamentals

By : Hyatt Saleh
Book Image

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (9 chapters)
Machine Learning Fundamentals
Preface

Chapter 5: Artificial Neural Networks: Predict Annual Income


Activity 14: Training a Multilayer Perceptron for our Census Income Dataset

  1. Using the preprocessed Census Income Dataset, separate the features from the target, creating the variables X and Y:

    X = data.drop("target", axis=1)
    Y = data["target"]

    As explained previously, there are several ways to achieve the separation of X and Y, and the main thing to consider is that X should contain the features for all instances, while Y should contain the class label of all instances.

  2. Divide the dataset into training, validation, and testing sets, using a split ratio of 10%:

    from sklearn.model_selection import train_test_split
    X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.1, random_state=101)
    X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size=0.1111, random_state=101)

    The shape of the sets created should be as follows:

    X_train = (26048, 9)
    X_dev = (3256, 9)
    X_test = (3257, 9)
    Y_train = (26048, )
    Y_dev = (3256, )
    Y_test = (3257, 1)
  3. From the neural_network module, import the Multilayer Perceptron Classifier class. Initialize it and train the model over the training data.

    Leave the hyperparameters to their default values. Again, use a random_state equal to 101:

    from sklearn.neural_network import MLPClassifier
    model = MLPClassifier(random_state=101)
    model = model.fit(X_train, Y_train)
  4. Address any warning that may appear after training the model with the default values for the hyperparameters.

    No warning was raised during the training process of the network, which means that the model was able to achieve convergence using the default values for the hyperparameters. Nevertheless, keep in mind that this does not mean that the best model was achieved, and changes in the hyperparameter values may result in better performance of the model.

    Calculate the accuracy of the model for all three sets (training, validation, and testing):

    from sklearn.metrics import accuracy_score
    
    X_sets = [X_train, X_dev, X_test]
    Y_sets = [Y_train, Y_dev, Y_test]
    
    accuracy = []
    
    for i in range(0,len(X_sets)):
    
      pred = model.predict(X_sets[i])
      score = accuracy_score(Y_sets[i], pred)
      accuracy.append(score)

    The accuracy score for the three sets should be as follows:

    Train sets = 0.8342
    Dev sets = 0.8111
    Test sets = 0.8252

Activity 15: Comparing Different Models to Choose the Best Fit for the Census Income Data Problem

  1. Open the Jupyter Notebook that you used to train the models.

  2. Compare the four models based on their accuracy score only.

    By taking the accuracy scores of the models from the previous chapter, it is possible to perform a final comparison to choose the model that better solves the data problem. To do so, the following table displays the accuracy scores for all four models:

    Figure 5.17: Accuracy scores of all four models for the Census Income Dataset

    To identify the model with the best performance, begin by comparing the accuracy rates over the training sets. From this, it is possible to conclude that the decision tree model is a better fit to the data problem. Nonetheless, the performance over the validation and testing sets is lower than the one achieved using the Multilayer Perceptron, which is an indication of the presence of high variance in the decision tree model.

    Hence, a good approach would be to address the high variance of the decision tree model by simplifying the model and adding a pruning argument, for instance (the pruning argument "trims" the leaves of the tree to simplify it and ignore some of the details of the tree in order to generalize the model to the data). Ideally, the model should be able to reach a similar level of accuracy for all three sets, which would make it the best model for the data problem.

    However, if the model is not able to overcome this variance, and assuming that all the models have been fine-tuned to achieve the maximum performance possible, the Multilayer Perceptron should be the model that's selected, considering that it performs best over the testing sets. This is mainly because the performance of the model over the testing set is the one that defines its overall performance over unseen data, which means that the one with higher testing set performance will be more useful in the long term.