Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Chapter 6: Anomaly Detection


Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method

Solution:

  1. Load the data as follows:

    data(islands)
  2. Draw a boxplot as follows:

    boxplot(islands)

    Figure 6.21: Boxplot of the islands dataset

    You should notice that the data is extremely fat-tailed, meaning that the median and interquartile range take up a relatively tiny portion of the plot compared to the many observations that R has classified as outliers.

  1. Create a new log-transformed dataset as follows:

    log_islands<-log(islands)
  2. Create a boxplot of the log-transformed data as follows:

    boxplot(log_islands)

    Figure 6.22: Boxplot of log-transformed dataset

    You should notice that there are only five outliers after the log transformation.

  3. Calculate the interquartile range:

    interquartile_range<-quantile(islands,.75)-quantile(islands,.25)
  4. Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

    upper_limit<-quantile(islands,.75)+1.5*interquartile_range
  5. Classify outliers as any observations above this upper limit:

    outliers<-islands[which(islands>upper_limit)]
  6. Calculate the interquartile range for the log-transformed data:

    interquartile_range_log<-quantile(log_islands,.75)-quantile(log_islands,.25)
  7. Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

    upper_limit_log<-quantile(log_islands,.75)+1.5*interquartile_range_log
  8. Classify outliers as any observations above this upper limit:

    outliers_log<-islands[which(log_islands>upper_limit_log)]
  9. Print the non-transformed outliers as follows:

    print(outliers)

    For the non-transformed outliers, we obtain the following:

    Figure 6.23: Non-transformed outliers

    Print the log-transformed outliers as follows:

    print(outliers_log)

    For the log-transformed outliers, we obtain the following:

    Figure 6.24: Log-transformed outliers

  10. Calculate the mean and standard deviation of the data:

    island_mean<-mean(islands)
    island_sd<-sd(islands)
  11. Select observations that are more than two standard deviations away from the mean:

    outliers<-islands[which(islands>(island_mean+2*island_sd))]
    outliers

    We obtain the following outliers:

    Figure 6.25: Screenshot of the outliers

  12. First, we calculate the mean and standard deviation of the log-transformed data:

    island_mean_log<-mean(log_islands)
    island_sd_log<-sd(log_islands)
  13. Select observations that are more than two standard deviations away from the mean:

    outliers_log<-log_islands[which(log_islands>(island_mean_log+2*island_sd_log))]
  14. We print the log-transformed outliers as follows:

    print(outliers_log)

    The output is as follows:

    Figure 6.26: Log-transformed outliers

Activity 15: Using Mahalanobis Distance to Find Anomalies

Solution:

  1. You can load and plot the data as follows:

    data(cars)
    plot(cars)

    The output plot is the following:

    Figure 6.27: Plot of the cars dataset

  2. Calculate the centroid:

    centroid<-c(mean(cars$speed),mean(cars$dist))
  3. Calculate the covariance matrix:

    cov_mat<-cov(cars)
  4. Calculate the inverse of the covariance matrix:

    inv_cov_mat<-solve(cov_mat)
  5. Create a NULL variable, which will hold each of our calculated distances:

    all_distances<-NULL
  6. We can loop through each observation and find the Mahalanobis distance between them and the centroid of the data:

    k<-1
    while(k<=nrow(cars)){
    the_distance<-cars[k,]-centroid
    mahalanobis_dist<-t(matrix(as.numeric(the_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(the_distance))
    all_distances<-c(all_distances,mahalanobis_dist)
    k<-k+1
    }
  7. Plot all observations that have particularly high Mahalanobis distances to see our outliers:

    plot(cars)
    points(cars$speed[which(all_distances>quantile(all_distances,.9))], cars$dist[which(all_distances>quantile(all_distances,.9))],col='red',pch=19)

    We can see the output plot as follows, with the outlier points shown in red:

    Figure 6.28: Plot with outliers marked