Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Chapter 5: Data Comparison Methods


Activity 11: Create an Image Signature for a Photograph of a Person

Solution:

  1. Download the Borges photo to your computer and save it as borges.jpg. Make sure that it is saved in R's working directory. If it is not in R's working directory, then change R's working directory using the setwd() function. Then, you can load this image into a variable called im (short for image), as follows:

    install.packages('imager')
    library('imager')
    filepath<-'borges.jpg'
    im <- imager::load.image(file =filepath) 

    The rest of the code we will explore will use this image, called im. Here, we have loaded a picture of the Alamo into im. However, you can run the rest of the code on any image, simply by saving the image to your working directory and specifying its path in the filepath variable.

  2. The signature we are developing is meant to be used for grayscale images. So, we will convert this image to grayscale, using functions in the imager package:

    im<-imager::rm.alpha(im)
    im<-imager::grayscale(im)
    im<-imager::imsplit(im,axis = "x", nb = 10)   

    The second line of this code is the conversion to grayscale. The last line performs a split of the image into 10 equal sections.

  3. The following code creates an empty matrix that we will fill with information about each section of our 10x10 grid:

    matrix <- matrix(nrow = 10, ncol = 10)

    Next, we will run the following loop. The first line of this loop uses the imsplit command. This command was also used previously to split the x axis into 10 equal parts. This time, for each of the 10 x-axis splits, we will do a split along the y-axis, also splitting it into 10 equal parts:

    for (i in 1:10) {
      is <- imager::imsplit(im = im[[i]], axis = "y", nb = 10)
      for (j in 1:10) {
        matrix[j,i] <- mean(is[[j]])
      }

    }

    The output so far is the matrix variable. We will use this in step 4.

  4. Get the signature of the Borges photograph by running the following code:

    borges_signature<-get_signature(matrix)
    borges_signature

    The output is as follows:

    Figure 5.12: Matrix of borges_signature

  5. Next, we will start calculating a signature using a 9x9 matrix, instead of a 10x10 matrix. We start with the same process we used before. The following lines of code load our Borges image like we did previously. The final line of this code splits the image into equal parts, but instead of 10 equal parts, we set nb=9 so that we split the image into 9 equal parts:

    filepath<-'borges.jpg'
    im <- imager::load.image(file =filepath) 
    im<-imager::rm.alpha(im)
    im<-imager::grayscale(im)
    im<-imager::imsplit(im,axis = "x", nb = 9)
  6. The following code creates an empty matrix that we will fill with information about each section of our 9x9 grid:

    matrix <- matrix(nrow = 9, ncol = 9)

    Note that we use nrow=9 and ncol=9 so that we have a 9x9 matrix to fill with our brightness measurements.

  7. Next, we will run the following loop. The first line of this loop uses the imsplit command. This command was also used earlier to split the x axis into 9 equal parts. This time, for each of the 9 x axis splits, we will do a split along the y axis, also splitting it into 9 equal parts:

    for (i in 1:9) {
      is <- imager::imsplit(im = im[[i]], axis = "y", nb = 9)
      for (j in 1:9) {
        matrix[j,i] <- mean(is[[j]])
      }
    }

    The output so far is the matrix variable. We will repeat Step 4.

  8. Get a 9x9 signature of the Borges photograph by running the following code:

    borges_signature_ninebynine<-get_signature(matrix)
    borges_signature_ninebynine

    The output is as follows:

    Figure 5.13: Matrix of borges_signature_ninebynine

Activity 12: Create an Image Signature for the Watermarked Image

Solution:

  1. Download the watermarked photo to your computer and save it as alamo_marked.jpg. Make sure that it is saved in R's working directory. If it is not in R's working directory, then change R's working directory using the setwd() function. Then, you can load this image into a variable called im (short for image), as follows:

    install.packages('imager')
    library('imager')
    filepath<-'alamo_marked.jpg'
    im <- imager::load.image(file =filepath) 

    The rest of the code we will explore will use this image called im. Here, we have loaded a watermarked picture of the Alamo into im. However, you can run the rest of the code on any image, simply by saving the image to your working directory, and specifying its path in the filepath variable.

  2. The signature we are developing is meant to be used for grayscale images. So, we will convert this image to grayscale by using functions in the imager package:

    im<-imager::rm.alpha(im)
    im<-imager::grayscale(im)
    im<-imager::imsplit(im,axis = "x", nb = 10)   

    The second line of this code is the conversion to grayscale. The last line performs a split of the image into 10 equal sections.

  3. The following code creates an empty matrix that we will fill with information about each section of our 10x10 grid:

    matrix <- matrix(nrow = 10, ncol = 10)

    Next, we will run the following loop. The first line of this loop uses the imsplit command. This command was also used earlier to split the x axis into 10 equal parts. This time, for each of the 10 x-axis splits, we will do a split along the y axis, also splitting it into 10 equal parts:

    for (i in 1:10) {
      is <- imager::imsplit(im = im[[i]], axis = "y", nb = 10)
      for (j in 1:10) {
        matrix[j,i] <- mean(is[[j]])
      }
    }

    The output so far is the matrix variable. We will use this in Step 4.

  4. We can get the signature of the watermarked photograph by running the following code:

    watermarked_signature<-get_signature(matrix)
    watermarked_signature

    The output is as follows:

    Figure 5.14: Signature of watermarked image

    The final output of this activity is the watermarked_signature variable, which is the analytic signature of the watermarked Alamo photo. If you have completed all of the exercises and activities so far, then you should have three analytic signatures: one called building_signature, one called borges_signature, and one called watermarked_signature.

  5. After completing this activity, we have stored this signature in a variable called watermarked_signature. Now, we can compare it to our original Alamo signature, as follows:

    comparison<-mean(abs(watermarked_signature-building_signature))
    comparison

    In this case, the result we get is 0.015, indicating a very close match between the original image signature and this new image's signature.

What we have seen is that our analytic signature method returns similar signatures for similar images, and different signatures for different images. This is exactly what we want a signature to do, and so we can judge this method a success.

Activity 13: Performing Factor Analysis

Solution:

  1. The data file can be downloaded from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson05/Data/factor.csv. Save it to your computer and make sure that it is in R's working directory. If you save it as factor.csv, then you can load it in R by executing the following command:

    factor<-read.csv('factor.csv')
  2. Load the psych package as follows:

    library(psych)
  3. We will be performing factor analysis on the user ratings, which are recorded in columns 2 through 11 of the data. We can select these columns as follows:

    ratings<-factor[,2:11]
  4. Create a correlation matrix of the ratings data as follows:

    ratings_cor<-cor(ratings)
  5. Determine the number of factors we should use by creating a scree plot. A scree plot is produced as one of the outputs of the following command:

    parallel <- fa.parallel(ratings_cor, fm = 'minres', fa = 'fa')
  6. The scree plot looks like the following:

    Figure 5.15: Parallel Analysis Scree Plots

    The scree plot shows one factor whose eigenvalue is much higher than the others. While we are free to choose any number of factors in our analysis, the single factor that is much larger than the others provides good reason to use one factor in our analysis.

  7. We can perform factor analysis as follows, specifying the number of factors in the nfactors parameter:

    factor_analysis<-fa(ratings_cor, nfactors=1)

    This stores the results of our factor analysis in a variable called factor_analysis:

  8. We can examine the results of our factor analysis as follows:

    print(factor_analysis)

    The output looks as follows:

    Figure 5.16: Result of factor analysis

    The numbers under MR1 show us the factor loadings for each category for our single factor. Since we have only one explanatory factor, all of the categories that have positive loadings on this factor are positively correlated with each other. We could interpret this factor as general positivity, since it would indicate that if people rate one category highly, they will also rate other categories highly, and if they rate one category poorly, they are likely to rate other categories poorly.

The only major exception to this rule is Category 10, which records users' average ratings of religious institutions. In this case, the factor loading is large and negative. This indicates that people who rate most other categories highly tend to rate religious institutions poorly, and vice versa. So, maybe we can interpret the positivity factor we have found as positivity about recreational activities, instead since religious institutions are arguably not places for recreation but rather for worship. It seems that, in this dataset, those who are positive about recreational activities are negative about worship, and vice versa. For the factor loadings that are close to 0, we can also conclude that the rule about positivity about recreation holds less strongly. You can see that factor analysis has enabled us to find relationships between the observations in our data that we had not previously suspected.