Book Image

Machine Learning Fundamentals

By : Hyatt Saleh
Book Image

Machine Learning Fundamentals

By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (9 chapters)
Machine Learning Fundamentals
Preface

Chapter 1: Introduction to scikit-learn


Activity 1: Selecting a Target Feature and Creating a Target Matrix

  1. Load the titanic dataset using the seaborn library. First, import the seaborn library, and then use the load_dataset("titanic") function:

    import seaborn as sns
    titanic = sns.load_dataset('titanic')
    titanic.head(10)

    Next, print out the top 10 instances; this should match the below screenshot:

    Figure 1.23: An image showing the first 10 instances of the Titanic dataset

  2. The preferred target feature could be either survived or alive. This is mainly because both of them label whether a person survived the crash. For the following steps, the variable chosen is survived. However, choosing alive will not affect the final shape of the variables.

  3. Create a variable, X, to store the features, by using drop(). As explained previously, the selected target feature is survived, which is why it is dropped from the features matrix.

    Create a variable, Y, to store the target matrix. Use indexing to access only the value from the column survived:

    X = titanic.drop('survived',axis = 1)
    Y = titanic['survived']
  4. Print out the shape of variable X, as follows:

    X.shape
    (891, 14)

    Do the same for variable Y:

    Y.shape
    (891,)

Activity 2: Preprocessing an Entire Dataset

  1. Load the dataset and create the features and target matrices:

    import seaborn as sns
    titanic = sns.load_dataset('titanic')
    X = titanic[['sex','age','fare','class','embark_town','alone']]
    Y = titanic['survived']
    X.shape
    (891, 6)
  2. Check for missing values in all features.

    As we did previously, use isnull() to determine whether a value is missing, and use sum() to sum up the occurrences of missing values along each feature:

    print("Sex: " + str(X['sex'].isnull().sum()))
    print("Age: " + str(X['age'].isnull().sum()))
    print("Fare: " + str(X['fare'].isnull().sum()))
    print("Class: " + str(X['class'].isnull().sum()))
    print("Embark town: " + str(X['embark_town'].isnull().sum()))
    print("Alone: " + str(X['alone'].isnull().sum()))

    The output will look as follows:

    Sex: 0
    Age: 177
    Fare: 0
    Class: 0
    Embark town: 2
    Alone: 0

    As you can see from the preceding screenshot, only two features contain missing values: age and embark_town.

  3. As age has many missing values that accounts for almost 20% of the total, the values should be replaced. Mean imputation methodology will be applied, as shown in the following code:

    #Age: missing values
    mean = X['age'].mean()
    mean = mean.round()
    X['age'].fillna(mean,inplace = True)

    Figure 1.24: A screenshot displaying the output of the preceding code

    After calculating the mean, the missing values are replaced by it using the fillna() function.

    Note

    The preceding warning may appear as the values are being replaced over a slice of the DataFrame. This happens because the variable X is created as a slice of the entire DataFrame titanic. As X is the variable that matters for the current exercise, it is not an issue to only replace the values over the slice and not over the entire DataFrame.

  4. Given that the number of missing values in the embark_town feature is low, the instances are eliminated from the features matrix:

    Note

    To eliminate the missing values from the embark_town feature, it is required to eliminate the entire instance (observation) from the matrix.

    # Embark_town: missing values
    X = X[X['embark_town'].notnull()]
    X.shape
    (889, 6)

    The notnull() function detects all non-missing values over the object in question. In this case, the function is used to obtain all non-missing values from the embark_town feature. Then, indexing is used to retrieve those values from the entire matrix (X).

  5. Discover the outliers present in the numeric features. Let's use three standard deviations as the measure to calculate the min and max threshold for numeric features. Using the formula that we have learned, the min and max threshold are calculated and compared against the min and max values of the feature:

    feature = "age"
    print("Min threshold: " + str(X[feature].mean() - (3 * X[feature].std())),"  Min val: " + str(X[feature].min()))
    print("Max threshold: " + str(X[feature].mean() + (3 * X[feature].std())),"  Max val: " + str(X[feature].max()))

    The values obtained for the above code are shown here:

    Min threshold: -9.194052030619016   Min val: 0.42
    Max threshold: 68.62075619259876   Max val: 80.0

    Use the following code to calculate the min and max threshold for the fare feature:

    feature = "fare"
    print("Min threshold: " + str(X[feature].mean() - (3 * X[feature].std())),"  Min val: " + str(X[feature].min()))
    print("Max threshold: " + str(X[feature].mean() + (3 * X[feature].std())),"  Max val: " + str(X[feature].max()))

    The values obtained for the above code are shown here:

    Min threshold: -116.99583207273355   Min val: 0.0
    Max threshold: 181.1891938275142   Max val: 512.3292

    As you can see from the preceding screenshots, both features stay inside the range at the lower end but go outside the range with the max values.

  6. The total count of outliers for the features age and fare are 7 and 20, respectively. Neither amount represents a high percentage of the total number of values, which is why outliers are eliminated from the features matrix. The following snippet can be used to eliminate the outliers and print the shape of the resulting matrix:

    # Age: outliers
    max_age = X["age"].mean() + (3 * X["age"].std())
    X = X[X["age"] <= max_age]
    X.shape
    (882, 6)
    
    # Fare: outliers
    max_fare = X["fare"].mean() + (3 * X["fare"].std())
    X = X[X["fare"] <= max_fare]
    X.shape
    (862, 6)
  7. Discover outliers present in text features. The value_counts() function is used to count the occurrence of the classes in each feature:

    feature = "alone"
    X[feature].value_counts()
    True     522
    False    340
    
    feature = "class"
    X[feature].value_counts()
    Third     489
    First     190
    Second    183
    
    feature = "alone"
    X[feature].value_counts()
    True     522
    False    340
    
    feature = "embark_town"
    X[feature].value_counts()
    Southampton     632
    Cherbourg       154
    Queenstown       76

    None of the classes for any of the features are considered to be outliers, as they all represent over 5% of the entire dataset.

  8. Convert all text features into their numeric representations. Use scikit-learn's LabelEncoder class, as shown in the following code:

    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    X["sex"] = enc.fit_transform(X['sex'].astype('str'))
    X["class"] = enc.fit_transform(X['class'].astype('str'))
    X["embark_town"] = enc.fit_transform(X['embark_town'].astype('str'))
    X["alone"] = enc.fit_transform(X['alone'].astype('str'))
  9. Print out the top 5 instances of the features matrix to view the result of the conversion:

    X.head()

    Figure 1.25: A screenshot displaying the first five instances of the features matrix

  10. Finally, apply normalization (or standardization) to the matrix.

    As you can see from the following code, the formula for normalization is only applied to those features that need normalizing. Given that normalization rescales the values between 0 and 1, all the features that have already met that condition do not need to be normalized:

    X["age"] = (X["age"] - X["age"].min())/(X["age"].max()-X["age"].min())
    X["fare"] = (X["fare"] - X["fare"].min())/(X["fare"].max()-X["fare"].min())
    X["class"] = (X["class"] - X["class"].min())/(X["class"].max()-X["class"].min())
    X["embark_town"] = (X["embark_town"] - X["embark_town"].min())/(X["embark_town"].max()-X["embark_town"].min())
    X.head(10)

    The top 10 rows of the final output are shown in the following screenshot:

    Figure 1.26: A screenshot displaying the first 10 instances of the normalized dataset