Book Image

Practical Machine Learning with R

By : Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, Monicah Wambugu
Book Image

Practical Machine Learning with R

By: Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, Monicah Wambugu

Overview of this book

With huge amounts of data being generated every moment, businesses need applications that apply complex mathematical calculations to data repeatedly and at speed. With machine learning techniques and R, you can easily develop these kinds of applications in an efficient way. Practical Machine Learning with R begins by helping you grasp the basics of machine learning methods, while also highlighting how and why they work. You will understand how to get these algorithms to work in practice, rather than focusing on mathematical derivations. As you progress from one chapter to another, you will gain hands-on experience of building a machine learning solution in R. Next, using R packages such as rpart, random forest, and multiple imputation by chained equations (MICE), you will learn to implement algorithms including neural net classifier, decision trees, and linear and non-linear regression. As you progress through the book, you’ll delve into various machine learning techniques for both supervised and unsupervised learning approaches. In addition to this, you’ll gain insights into partitioning the datasets and mechanisms to evaluate the results from each model and be able to compare them. By the end of this book, you will have gained expertise in solving your business problems, starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain it.
Table of Contents (8 chapters)

Chapter 2: Data Cleaning and Pre-processing

Activity 6: Pre-processing using Center and Scale


In this exercise, we will perform the center and scale pre-processing operations.

  1. Load the mlbench library and the PimaIndiansDiabetes dataset:

    # Load Library caret



    # load the dataset PimaIndiansDiabetes


    View the summary:

    # view the data

    summary(PimaIndiansDiabetes [,1:2])

    The output is as follows:

        pregnant         glucose     

    Min.   : 0.000   Min.   :  0.0  

    1st Qu.: 1.000   1st Qu.: 99.0  

    Median : 3.000   Median :117.0  

    Mean   : 3.845   Mean   :120.9  

    3rd Qu.: 6.000   3rd Qu.:140.2  

    Max.   :17.000   Max.   :199.0

  2. User preProcess() to pre-process the data to center and scale:

    # to standardise we will scale and center

    params <- preProcess(PimaIndiansDiabetes [,1:2], method=c("center", "scale"))

  3. Transform the dataset using predict():

    # transform the dataset

    new_dataset <- predict(params, PimaIndiansDiabetes [,1:2])

  4. Print the summary of the new dataset:

    # summarize the transformed dataset


    The output is as follows:

        pregnant          glucose       

    Min.   :-1.1411   Min.   :-3.7812  

    1st Qu.:-0.8443   1st Qu.:-0.6848  

    Median :-0.2508   Median :-0.1218  

    Mean   : 0.0000   Mean   : 0.0000  

    3rd Qu.: 0.6395   3rd Qu.: 0.6054  

    Max.   : 3.9040   Max.   : 2.4429

    We will notice that the values are now mean centering values.

Activity 7: Identifying Outliers


  1. Load the dataset:

    mtcars = read.csv("mtcars.csv")

  2. Load the outlier package and use the outlier function to display the outliers:

    #Load the outlier library


  3. Detect outliers in the dataset using the outlier() function:

    #Detect outliers


    The output is as follows:

        mpg     cyl    disp      hp    drat      wt    qsec      vs      am

        gear    carb

    33.900   4.000 472.000 335.000   4.930   5.424  22.900   

    1.000   1.000   5.000   8.000

  4. Display the other side of the outlier values:

    #This detects outliers from the other side


    The output is as follows:

       mpg    cyl   disp     hp   drat     wt   qsec     vs     am

       gear   carb

    10.400  8.000 71.100 52.000  2.760  1.513 14.500  0.000  0.000

      3.000  1.000

  5. Plot a box plot:

    #View the outliers


    The output is as follows:

Figure 2.36: Outliers in the mtcars dataset.
Figure 2.36: Outliers in the mtcars dataset.

The circle marks are the outliers.

Activity 8: Oversampling and Undersampling


The detailed solution is as follows:

  1. Read the mushroom CSV file:



    The output is as follows:

       f    t

    4748 3376

  2. Perform downsampling:


    undersampling <- downSample(x = ms[, -ncol(ms)], y = ms$bruises)


    The output is as follows:

       f    t

    3376 3376

  3. Perform oversampling:


    oversampling <- upSample(x = ms[, -ncol(ms)],y = ms$bruises)


    The output is as follows:

       f    t

    4748 4748

    In this activity, we learned to use downSample() and upSample() from the caret package to perform downsampling and oversampling.

Activity 9: Sampling and OverSampling using ROSE


The detailed solution is as follows:

  1. Load the German credit dataset:

    #load the dataset




  2. View the samples in the German credit dataset:

    #View samples



  3. Check the number of unbalanced data in the German credit dataset using the summary() method:

    #View the imbalanced data


    The output is as follows:

    Bad Good

     300  700

  4. Use ROSE to balance the numbers:

    balanced_data <- ROSE(Class ~ ., data  = stagec,seed=3)$data


    The output is as follows:

    Good  Bad

     480  520

    Using the preceding example, we learned how to increase and decrease the class count using ROSE.