Book Image

Machine Learning Fundamentals

By : Hyatt Saleh
Book Image

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (9 chapters)
Machine Learning Fundamentals
Preface

Chapter 4: Supervised Learning Algorithms: Predict Annual Income


Activity 11: Training a Naïve Bayes Model for our Census Income Dataset

Before working on step 1, make sure that the data has been preprocessed, as follows:

import pandas as pd
data = pd.read_csv("datasets/census_income_dataset.csv")
data = data.drop(["fnlwgt","education","relationship","sex", "race"], axis=1)

After reading the dataset, the three variables considered irrelevant for the study are removed.

Next, the remaining qualitative variables are converted into their numerical form via the following code:

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()

features_to_convert = ["workclass","marital-status","occupation","native-country","target"]

for i in features_to_convert:
  data[i] = enc.fit_transform(data[i].astype('str'))

Once this is complete, you can begin with the steps of the activity:

  1. Using the preprocessed Census Income Dataset, separate the features from the target by creating the variables X and Y:

    X = data.drop("target", axis=1)
    Y = data["target"]

    Note that there are several ways to achieve the separation of X and Y. Use the one that you feel most comfortable with. However, take into account that X should contain the features for all instances, while Y should contain the class label of all instances.

  2. Divide the dataset into training, validation, and testing sets, using a split ratio of 10%:

    from sklearn.model_selection import train_test_split
    
    X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.1, random_state=101)
    
    X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size=0.12, random_state=101)

    The final shape of all sets must match the values shown in the following code:

    X_train = (26048, 9)
    Y_train = (26048, )
    X_dev = (3256, 9)
    Y_dev = (3256, )
    X_test = (3257, 9)
    Y_test = (3257, )
  3. Import the Gaussian Naïve Bayes class, and then use the fit method to train the model over the training sets (X_train and Y_train):

    from sklearn.naive_bayes import GaussianNB
    
    model_NB = GaussianNB()
    model_NB.fit(X_train,Y_train) 
  4. Finally, perform a prediction using the model that you trained previously for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.

    Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:

    pred_1 = model_NB.predict([[39,6,13,4,0,4,1,2174,0,40,38]])
    print(pred_1)

Activity 12: Training a Decision Tree Model for our Census Income Dataset

The shape of the previously created subsets must be as follows:

X_train = (26048, 11)
Y_train = (26048, 1)
X_dev = (3256, 11)
Y_dev = (3256, 1)
X_test = (3257, 11)
Y_test = (3257, 1)
  1. Using the preprocessed Census Income Dataset that was previously split into the different subsets, import the DecisionTreeClassifier class, and then use the fit method to train the model over the training sets (X_train and Y_train):

    from sklearn.tree import DecisionTreeClassifier
    
    model_tree = DecisionTreeClassifier()
    model_tree.fit(X_train,Y_train) 
  2. Finally, perform a prediction using the model that you trained before for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.

    Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:

    pred_2 = model_tree.predict([[39,6,13,4,0,4,1,2174,0,40,38]])
    print(pred_2)

Activity 13: Training a SVM Model for our Census Income Dataset

The shape of the previously created subsets must be as follows:

X_train = (26048, 11)
Y_train = (26048, 1)
X_dev = (3256, 11)
Y_dev = (3256, 1)
X_test = (3257, 11)
Y_test = (3257, 1)
  1. Using the preprocessed Census Income Dataset that was previously split into the different subsets, import the SVC class, and then use the fit method to train the model over the training sets (X_train and Y_train):

    from sklearn.svm import SVC
    
    model_svm = SVC()
    model_svm.fit(X_train,Y_train)
  2. Finally, perform a prediction using the model that you trained before for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.

    Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:

    pred_3 = model_svm.predict([[39,6,13,4,0,4,1,2174,0,40,38]])
    print(pred_3)