2. Hierarchical Clustering | The Unsupervised Learning Workshop

Book Overview & Buying
Table Of Contents

The Unsupervised Learning Workshop

By : Aaron Jones , Richard Brooker, John Wesley Doyle , Priyanjit Ghosh, Sani Kamal, Ashish Pratik Patil , Philip Solomon, Geetank Raipuria, Christopher Kruger , Benjamin Johnston

4.3 (6)

Buy this Book

The Unsupervised Learning Workshop

4.3 (6)

By: Aaron Jones , Richard Brooker, John Wesley Doyle , Priyanjit Ghosh, Sani Kamal, Ashish Pratik Patil , Philip Solomon, Geetank Raipuria, Christopher Kruger , Benjamin Johnston

Buy this Book

Overview of this book

Do you find it difficult to understand how popular companies like WhatsApp and Amazon find valuable insights from large amounts of unorganized data? The Unsupervised Learning Workshop will give you the confidence to deal with cluttered and unlabeled datasets, using unsupervised algorithms in an easy and interactive manner. The book starts by introducing the most popular clustering algorithms of unsupervised learning. You'll find out how hierarchical clustering differs from k-means, along with understanding how to apply DBSCAN to highly complex and noisy data. Moving ahead, you'll use autoencoders for efficient data encoding. As you progress, you’ll use t-SNE models to extract high-dimensional information into a lower dimension for better visualization, in addition to working with topic modeling for implementing natural language processing (NLP). In later chapters, you’ll find key relationships between customers and businesses using Market Basket Analysis, before going on to use Hotspot Analysis for estimating the population density of an area. By the end of this book, you’ll be equipped with the skills you need to apply unsupervised algorithms on cluttered datasets to find useful patterns and insights.

Preface

About the Book

1. Introduction to Clustering

Introduction

Unsupervised Learning versus Supervised Learning

Clustering

Introduction to k-means Clustering

Summary

Free Chapter

2. Hierarchical Clustering

Introduction

Clustering Refresher

The Organization of the Hierarchy

Introduction to Hierarchical Clustering

Linkage

Agglomerative versus Divisive Clustering

k-means versus Hierarchical Clustering

Summary

3. Neighborhood Approaches and DBSCAN

Introduction

Clusters as Neighborhoods

Introduction to DBSCAN

DBSCAN versus k-means and Hierarchical Clustering

Summary

4. Dimensionality Reduction Techniques and PCA

Introduction

What Is Dimensionality Reduction?

Overview of Dimensionality Reduction Techniques

Principal Component Analysis

Summary

5. Autoencoders

Introduction

Fundamentals of Artificial Neural Networks

Autoencoders

Summary

6. t-Distributed Stochastic Neighbor Embedding

Introduction

The MNIST Dataset

Stochastic Neighbor Embedding (SNE)

t-Distributed SNE

Interpreting t-SNE Plots

Summary

7. Topic Modeling

Introduction

Topic Models

Cleaning Text Data

Latent Dirichlet Allocation

Non-Negative Matrix Factorization

Summary

8. Market Basket Analysis

Introduction

Market Basket Analysis

Characteristics of Transaction Data

The Apriori Algorithm

Association Rules

Summary

9. Hotspot Analysis

Introduction

Spatial Statistics

Kernel Density Estimation

Hotspot Analysis

Summary

Appendix

1. Introduction to Clustering

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimensionality Reduction Techniques and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Clustering Refresher

Chapter 1, Introduction to Clustering, covered both the high-level concepts and in-depth details of one of the most basic clustering algorithms: k-means. While it is indeed a simple approach, do not discredit it; it will be a valuable addition to your toolkit as you continue your exploration of the unsupervised learning world. In many real-world use cases, companies experience valuable discoveries through the simplest methods, such as k-means or linear regression (for supervised learning). An example of this is evaluating a large selection of customer data – if you were to evaluate it directly in a table, it would be unlikely that you'd find anything helpful. However, even a simple clustering algorithm can identify where groups within the data are similar and dissimilar. As a refresher, let's quickly walk through what clusters are and how k-means works to find them:

Figure 2.1: The attributes that separate supervised and unsupervised problems

If you were given a random collection of data without any guidance, you would probably start your exploration using basic statistics – for example, the mean, median, and mode values for each of the features. Given a dataset, choosing supervised or unsupervised learning as an approach to derive insights is dependent on the data goals that you have set for yourself. If you were to determine that one of the features was actually a label and you wanted to see how the remaining features in the dataset influence it, this would become a supervised learning problem. However, if, after initial exploration, you realized that the data you have is simply a collection of features without a target in mind (such as a collection of health metrics, purchase invoices from a web store, and so on), then you could analyze it through unsupervised methods.

A classic example of unsupervised learning is finding clusters of similar customers in a collection of invoices from a web store. Your hypothesis is that by finding out which people are the most similar, you can create more granular marketing campaigns that appeal to each cluster's interests. One way to achieve these clusters of similar users is through k-means.

The k-means Refresher

The k-means clustering works by finding "k" number of clusters in your data through certain distance calculations such as Euclidean, Manhattan, Hamming, Minkowski, and so on. "K" points (also called centroids) are randomly initialized in your data and the distance is calculated from each data point to each of the centroids. The minimum of these distances designates which cluster a data point belongs to. Once every point has been assigned to a cluster, the mean intra-cluster data point is calculated as the new centroid. This process is repeated until the newly calculated cluster centroid no longer changes position or until the maximum limit of iterations is reached.

The Unsupervised Learning Workshop

By : Aaron Jones , Richard Brooker, John Wesley Doyle , Priyanjit Ghosh, Sani Kamal, Ashish Pratik Patil , Philip Solomon, Geetank Raipuria, Christopher Kruger , Benjamin Johnston

The Unsupervised Learning Workshop

By: Aaron Jones , Richard Brooker, John Wesley Doyle , Priyanjit Ghosh, Sani Kamal, Ashish Pratik Patil , Philip Solomon, Geetank Raipuria, Christopher Kruger , Benjamin Johnston

Overview of this book

Clustering Refresher

The k-means Refresher

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access