Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Chapter 4: Dimension Reduction


Activity 10: Performing PCA and Market Basket Analysis on a New Dataset

Solution:

  1. Before starting our main analysis, we will remove one variable that will not be relevant to us:

    Boston<-Boston[,-12]
  2. We will create dummy variables. We will end up with one original dataset, and one dummy variable dataset. We do that as follows:

    Boston_original<-Boston

    Next, we will create dummy variables for each of the measurements in the original dataset. You can find out the meaning of each of the variables in the dataset in the documentation of the MASS package, available at https://cran.r-project.org/web/packages/MASS/MASS.pdf.

  3. Create dummy variables for whether a town has high or low crime per capita:

    Boston$highcrim<-1*(Boston$indus>median(Boston$crim))
    Boston$lowcrim<-1*(Boston$indus<=median(Boston$crim))

    Create dummy variables for whether a town has a high or low proportion of land zoned for lots over 25,000 feet:

    Boston$highzn<-1*(Boston$zn>median(Boston$zn))
    Boston$lowzn<-1*(Boston$zn<=median(Boston$zn))

    Create dummy variables for whether a town has a high or low proportion of non-retail business acres per town:

    Boston$highindus<-1*(Boston$indus>median(Boston$indus))
    Boston$lowindus<-1*(Boston$indus<=median(Boston$indus))

    Create dummy variables for whether a town borders the Charles River:

    Boston$highchas<-(Boston$chas)
    Boston$lowchas<-(1-Boston$chas)

    Create dummy variables for whether a town has a high or low nitrogen oxide concentration:

    Boston$highnox<-1*(Boston$nox>median(Boston$nox))
    Boston$lownox<-1*(Boston$nox<=median(Boston$nox))

    Create dummy variables for whether a town has a high or low average number of rooms per dwelling:

    Boston$highrm<-1*(Boston$rm>median(Boston$rm))
    Boston$lowrm<-1*(Boston$rm<=median(Boston$rm))

    Create dummy variables for whether a town has a high or low proportion of owner-occupied units built prior to 1940:

    Boston$highage<-1*(Boston$age>median(Boston$age))
    Boston$lowage<-1*(Boston$age<=median(Boston$age))

    Create dummy variables for whether a town has a high or low average distance to five of Boston's employment centers:

    Boston$highdis<-1*(Boston$dis>median(Boston$dis))
    Boston$lowdis<-1*(Boston$dis<=median(Boston$dis))

    Create dummy variables for whether a town has a high or low index of accessibility to radial highways:

    Boston$highrad<-1*(Boston$rad>median(Boston$rad))
    Boston$lowrad<-1*(Boston$rad<=median(Boston$rad))

    Create dummy variables for whether a town has a high or low full-value property tax rate:

    Boston$hightax<-1*(Boston$tax>median(Boston$tax))
    Boston$lowtax<-1*(Boston$tax<=median(Boston$tax))

    Create dummy variables for whether a town has a high or low pupil-teacher ratio:

    Boston$highptratio<-1*(Boston$ptratio>median(Boston$ptratio))
    Boston$lowptratio<-1*(Boston$ptratio<=median(Boston$ptratio))

    Create dummy variables for whether a town has a high or low proportion of lower-status population:

    Boston$highlstat<-1*(Boston$lstat>median(Boston$lstat))
    Boston$lowlstat<-1*(Boston$lstat<=median(Boston$lstat))

    Create dummy variables for whether a town has a high or low median home value:

    Boston$highmedv<-1*(Boston$medv>median(Boston$medv))
    Boston$lowmedv<-1*(Boston$medv<=median(Boston$medv))
  4. Create a dataset that consists entirely of the dummy variables we have just created:

    Bostondummy<-Boston[,14:ncol(Boston)]
  5. Finally, we will restore our Boston_2 dataset to its original form before all of the dummy variables were added:

    Boston<-Boston_original
  6. Calculate the eigenvalues and eigenvectors of the covariance matrix of the dataset, as follows:

    Boston_cov<-cov(Boston)
    Boston_eigen<-eigen(Boston_cov)
    print(Boston_eigen$vectors)

    The output is as follows:

    Figure 4.17: Eigenvectors of the covariance matrix

  7. Print eigen values as follows:

    print(Boston_eigen$values)

    The output is as follows:

    Figure 4.18: Eigenvalues of the covariance matrix

  8. For the third part, we create a simple scree plot based on the eigenvalues:

    plot(Boston_eigen$values,type='o')

    The output is as follows:

    Figure 4.19: Plot of the eigenvalues

  9. Next, we choose the number of eigenvectors we will use (I chose 10), and we transform the dataset to be 10-dimensional, as follows:

    neigen<-10
    transformed<-t(t(as.matrix(Boston_eigen$vectors[,1:neigen])) %*% t(as.matrix(Boston)))
  10. Then, we restore the dataset as much as possible:

    restored<- t(as.matrix(Boston_eigen$vectors[,1:neigen]) %*% t(as.matrix(transformed)))
  11. Finally, we can check how close our restoration is to the original dataset, as follows:

    print(head(restored-Boston))
  12. Here, we need to specify a support threshold (for example, 20%), and complete the first pass through the data:

    support_thresh<-0.2
    firstpass<-unname(which(colMeans(Bostondummy,na.rm=TRUE)>support_thresh))
  13. Here, we complete the second pass through the data:

    secondcand<-t(combn(firstpass,2))
    secondpass<-NULL
    k<-1
    while(k<=nrow(secondcand)){
    support<-mean(Bostondummy[,secondcand[k,1]]*Bostondummy[,secondcand[k,2]],na.rm=TRUE)
    if(support>support_thresh){
    secondpass<-rbind(secondpass,secondcand[k,])
    }
    k<-k+1
    }
  14. Here, we complete the third pass, and then do filtering based on the confidence and lift thresholds:

    thirdpass<-NULL
    k<-1
    while(k<=nrow(secondpass)){
    j<-1
    while(j<=length(firstpass)){
    n<-1
    product<-1
    while(n<=ncol(secondpass)){
    product<-product*Bostondummy[,secondpass[k,n]]
    n<-n+1
    }
    if(!(firstpass[j] %in% secondpass[k,])){
    product<-product*Bostondummy[,firstpass[j]]
    support<-mean(product,na.rm=TRUE)
    if(support>support_thresh){
    thirdpass<-rbind(thirdpass,c(secondpass[k,],firstpass[j]))
    }
    }
    j<-j+1
    }
    k<-k+1
    }
    
    
    thirdpass_conf<-NULL
    k<-1
    while(k<=nrow(thirdpass)){
    
    support<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)
    confidence<-mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]]*Bostondummy[,thirdpass[k,3]],na.rm=TRUE)/mean(Bostondummy[,thirdpass[k,1]]*Bostondummy[,thirdpass[k,2]],na.rm=TRUE)
    
    lift<-confidence/mean(Bostondummy[,thirdpass[k,3]],na.rm=TRUE)
    
    thirdpass_conf<-rbind(thirdpass_conf,unname(c(thirdpass[k,],support,confidence,lift)))
    k<-k+1
    }
  15. Our final output is the list of three-item baskets that have passed the support, confidence, and lift thresholds:

    print(head(thirdpass_conf))

    The output is as follows:

    Figure 4.20: Output of the three-item basket