Book Image

Applied Supervised Learning with R

By : Karthik Ramasubramanian, Jojo Moolayil
Book Image

Applied Supervised Learning with R

By: Karthik Ramasubramanian, Jojo Moolayil

Overview of this book

R provides excellent visualization features that are essential for exploring data before using it in automated learning. Applied Supervised Learning with R helps you cover the complete process of employing R to develop applications using supervised machine learning algorithms for your business needs. The book starts by helping you develop your analytical thinking to create a problem statement using business inputs and domain research. You will then learn different evaluation metrics that compare various algorithms, and later progress to using these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study the hyperparameter optimization technique to fine-tune your set of optimal parameters. The book demonstrates how you can add different regularization terms to avoid overfitting your model. By the end of this book, you will have gained the advanced skills you need for modeling a supervised machine learning algorithm that precisely fulfills your business needs.
Table of Contents (12 chapters)
Applied Supervised Learning with R
Preface

Chapter 1: R for Advanced Analytics


Activity 1: Create an R Markdown File to Read a CSV File and Write a Summary of Data

  1. Start the RStudio and navigate to Files | New Files | R Markdown.

  2. On the New R Markdown window, provide the Title and Author name, as illustrated in the following screenshot. Ensure that you select the Word option under the Default Output Format section:

    Figure 1.13: Creating a new R Markdown file in Rstudio

  3. Now, use the read.csv() method to read the bank-full.csv file:

    Figure 1.14: Using the read.csv method to read the data

  4. Finally, print the summary into a word file using the summary method:

    Figure 1.15: Final output after using the summary method

Activity 2: Create a List of Two Matrices and Access the Values

  1. Create two matrices of size 10 x 4 and 4 x 5 by randomly generated numbers from a binomial distribution (use rbinom method). Call the matrix mat_A and mat_B, respectively:

    mat_A <- matrix(rbinom(n = 40, size = 100, prob = 0.4),nrow = 10, ncol=4)
    mat_B <- matrix(rbinom(n = 20, size = 100, prob = 0.4),nrow = 4, ncol=5)
  2. Now, store the two matrices in a list:

    list_of_matrices <- list(mat_A = mat_A, mat_B =mat_B)
  3. Using the list, access the row 4 and column 2 of mat_A and store it in variable A, and access row 2 and column 1 of mat_B and store it in variable B:

    A <- list_of_matrices[["mat_A"]][4,2]
    B <- list_of_matrices[["mat_B"]][2,1]
  4. Multiply the A and B matrices and subtract from row 2 and column 1 of mat_A:

    list_of_matrices[["mat_A"]][2,1] - (A*B)

    The output is as follows:

    ## [1] -1554

Activity 3: Create a DataFrame with Five Summary Statistics for All Numeric Variables from Bank Data Using dplyr and tidyr

  1. Import the dplyr and tidyr packages in the system:

    library(dplyr)
    library(tidyr)
    Warning: package 'tidyr' was built under R version 3.2.5
  2. Create the df DataFrame and import the file into it:

    df <- tbl_df(df_bank_detail)
  3. Extract all numeric variables from bank data using select(), and compute min, 1st quartile, 3rd quartile, median, mean, max, and standard deviation using the summarise_all() method:

    df_wide <- df %>%
      select(age, balance, duration, pdays) %>% 
      summarise_all(funs(min = min, 
                          q25 = quantile(., 0.25), 
                          median = median, 
                          q75 = quantile(., 0.75), 
                          max = max,
                          mean = mean, 
                          sd = sd))
  4. The result is a wide data frame. 4 variable, 7 measures:

    dim(df_wide)
    ## [1]  1 28
  5. Store the result in a DataFrame of wide format named df_wide, reshape it using the tidyr functions, and, finally, convert the wide format to deep, use the gather, separate, and spread functions of the tidyr package:

    df_stats_tidy <- df_wide %>% gather(stat, val) %>%
      separate(stat, into = c("var", "stat"), sep = "_") %>%
      spread(stat, val) %>%
      select(var,min, q25, median, q75, max, mean, sd) # reorder columns
    print(df_stats_tidy)

    The output is as follows:

    ## # A tibble: 4 x 8
    ##        var   min   q25 median   q75    max       mean         sd
    ## *    <chr> <dbl> <dbl>  <dbl> <dbl>  <dbl>      <dbl>      <dbl>
    ## 1      age    18    33     39    48     95   40.93621   10.61876
    ## 2  balance -8019    72    448  1428 102127 1362.27206 3044.76583
    ## 3 duration     0   103    180   319   4918  258.16308  257.52781
    ## 4    pdays    -1    -1     -1    -1    871   40.19783  100.12875