## Chapter 6: Anomaly Detection

### Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method

Solution:

Load the data as follows:

data(islands)

Draw a boxplot as follows:

boxplot(islands)

You should notice that the data is extremely fat-tailed, meaning that the median and interquartile range take up a relatively tiny portion of the plot compared to the many observations that R has classified as outliers.

Create a new log-transformed dataset as follows:

log_islands<-log(islands)

Create a boxplot of the log-transformed data as follows:

boxplot(log_islands)

You should notice that there are only five outliers after the log transformation.

Calculate the interquartile range:

interquartile_range<-quantile(islands,.75)-quantile(islands,.25)

Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

upper_limit<-quantile(islands,.75)+1.5*interquartile_range

Classify outliers as any observations above this upper limit:

outliers<-islands[which(islands>upper_limit)]

Calculate the interquartile range for the log-transformed data:

interquartile_range_log<-quantile(log_islands,.75)-quantile(log_islands,.25)

Add 1.5 times the interquartile range to the third quartile to get the upper limit of the non-outlier data:

upper_limit_log<-quantile(log_islands,.75)+1.5*interquartile_range_log

Classify outliers as any observations above this upper limit:

outliers_log<-islands[which(log_islands>upper_limit_log)]

Print the non-transformed outliers as follows:

print(outliers)

For the non-transformed outliers, we obtain the following:

Print the log-transformed outliers as follows:

print(outliers_log)

For the log-transformed outliers, we obtain the following:

Calculate the mean and standard deviation of the data:

island_mean<-mean(islands) island_sd<-sd(islands)

Select observations that are more than two standard deviations away from the mean:

outliers<-islands[which(islands>(island_mean+2*island_sd))] outliers

We obtain the following outliers:

First, we calculate the mean and standard deviation of the log-transformed data:

island_mean_log<-mean(log_islands) island_sd_log<-sd(log_islands)

Select observations that are more than two standard deviations away from the mean:

outliers_log<-log_islands[which(log_islands>(island_mean_log+2*island_sd_log))]

We print the log-transformed outliers as follows:

print(outliers_log)

The output is as follows:

### Activity 15: Using Mahalanobis Distance to Find Anomalies

Solution:

You can load and plot the data as follows:

data(cars) plot(cars)

The output plot is the following:

Calculate the centroid:

centroid<-c(mean(cars$speed),mean(cars$dist))

Calculate the covariance matrix:

cov_mat<-cov(cars)

Calculate the inverse of the covariance matrix:

inv_cov_mat<-solve(cov_mat)

Create a

**NULL**variable, which will hold each of our calculated distances:all_distances<-NULL

We can loop through each observation and find the Mahalanobis distance between them and the centroid of the data:

k<-1 while(k<=nrow(cars)){ the_distance<-cars[k,]-centroid mahalanobis_dist<-t(matrix(as.numeric(the_distance)))%*% matrix(inv_cov_mat,nrow=2) %*% matrix(as.numeric(the_distance)) all_distances<-c(all_distances,mahalanobis_dist) k<-k+1 }

Plot all observations that have particularly high Mahalanobis distances to see our outliers:

plot(cars) points(cars$speed[which(all_distances>quantile(all_distances,.9))], cars$dist[which(all_distances>quantile(all_distances,.9))],col='red',pch=19)

We can see the output plot as follows, with the outlier points shown in red: