Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Basic Terminology of Probability Distributions


There are two families of methods in statistics: parametric and non-parametric methods. Non-parametric methods are meant to deal with data that could take any shape. Parametric methods, by contrast, make assumptions about the particular shape that data takes on. These assumptions are often encoded as parameters. The following are the two main parameters that you should be aware of:

  • Mean: This is the average of all values in the distribution.

  • Standard Deviation: This is the measure of the spread of values around the mean of a distribution.

Most of the parametric methods in statistics depend in some way on those two parameters. The parametric distributions that we're going to study in this chapter are these:

  • Uniform distributions

  • Normal distributions

  • Log-normal distributions.

  • Binomial distributions

  • Poisson distributions

  • Pareto distributions

Uniform Distribution

In the uniform distribution, all values between an interval, let's say [a,b], are equiprobable...