-
Book Overview & Buying
-
Table Of Contents
The Unsupervised Learning Workshop
By :
Do you find it difficult to understand how popular companies like WhatsApp and Amazon find valuable insights from large amounts of unorganized data? The Unsupervised Learning Workshop will give you the confidence to deal with cluttered and unlabeled datasets, using unsupervised algorithms in an easy and interactive manner.
The book starts by introducing the most popular clustering algorithms of unsupervised learning. You'll find out how hierarchical clustering differs from k-means, along with understanding how to apply DBSCAN to highly complex and noisy data. Moving ahead, you'll use autoencoders for efficient data encoding.
As you progress, you'll use t-SNE models to extract high-dimensional information into a lower dimension for better visualization, in addition to working with topic modeling for implementing Natural Language Processing. In later chapters, you'll find key relationships between customers and businesses using Market Basket Analysis, before going on to use Hotspot Analysis for estimating the population density of an area.
By the end of this book, you'll be equipped with the skills you need to apply unsupervised algorithms on cluttered datasets to find useful patterns and insights.
If you are a data scientist who is just getting started and want to learn how to implement machine learning algorithms to build predictive models, then this book is for you. To expedite the learning process, a solid understanding of the Python programming language is recommended, as you'll be editing classes and functions instead of creating them from scratch.
Chapter 1, Introduction to Clustering, introduces clustering (the most well-known family of unsupervised learning algorithms), before digging into the simplest and most popular clustering algorithm—k-means.
Chapter 2, Hierarchical Clustering, covers another clustering technique, hierarchical clustering, and explains how it differs from k-means. The chapter teaches you two main approaches to this type of clustering: agglomerative and divisive.
Chapter 3, Neighborhood Approaches and DBSCAN, explores clustering approaches that involve neighbors. Unlike the two other clustering approaches, the neighborhood approaches allow outlier points that are not assigned to any particular cluster.
Chapter 4, Dimensionality Reduction and PCA, teaches you how to navigate large feature spaces by leveraging principal component analysis to reduce the number of features while maintaining the explanatory power of the whole feature space.
Chapter 5, Autoencoders, shows you how neural networks can be leveraged to find data encodings. Data encodings are like combinations of features that reduce the dimensionality of the feature space. Autoencoders also decode the data and put it back into its original form.
Chapter 6, t-Distributed Stochastic Neighbor Embedding, discusses the process of reducing high-dimensional datasets down to two or three dimensions for the purpose of visualization. Unlike PCA, t-SNE is a non-linear, probabilistic model.
Chapter 7, Topic Modeling, explores the fundamental methodology of natural language processing. You will learn how to work with text data and fit Latent Dirichlet Allocation and Non-negative Matrix Factorization models to tag topics relevant to the text.
Chapter 8, Market Basket Analysis, explores a classic analytical technique used in retail businesses. You will, in a scalable way, build association rules that explain the relationships between groups of items.
Chapter 9, Hotspot Analysis, teaches you to estimate the true population density of some random variable using sample data. This technique is applicable to many fields, including epidemiology, weather, crime, and demography.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"Plot the coordinate points using the scatterplot functionality we imported from matplotlib.pyplot."
Words that you see on the screen (for example, in menus or dialog boxes) appear in the same format.
A block of code is set as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist
seeds = pd.read_csv('Seed_Data.csv')
New terms and important words are shown like this:
"Unsupervised learning is the field of practice that helps find patterns in cluttered data and is one of the most exciting areas of development in machine learning today."
Long code snippets are truncated and the corresponding names of the code files on GitHub are placed at the top of the truncated code. The permalinks to the entire code are placed below the code snippet. It should look as follows:
Exercise1.04-Exercise1.05.ipynb
def k_means(X, K): # Keep track of history so you can see K-Means in action centroids_history = [] labels_history = [] rand_index = np.random.choice(X.shape[0], K) centroids = X[rand_index] centroids_history.append(centroids)
The complete code for this step can be found at https://packt.live/2JM8Q1S.
Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
For example:
history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \ validation_split=0.2, shuffle=False)
Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:
# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])
Multi-line comments are enclosed by triple quotes, as shown below:
""" Define a seed for the random number generator to ensure the result will be reproducible """ seed = 1 np.random.seed(seed) random.set_seed(seed)
Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.
For the optimal user experience, we recommend 8 GB RAM.
The following section will help you to install Python in Windows, macOS, and Linux systems.
python3 --version.sudo apt-get update sudo apt-get install python3.7
Here are the steps to install Python on macOS:
terminal in the open search box, and hitting Enter.xcode-select --install.ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"PATH environment variable. Open your profile in the command line by running sudo nano ~/.profile and inserting export PATH="/usr/local/opt/python/libexec/bin:$PATH" at the bottom.brew install python.Python does not come with pip (the package manager for Python) pre-installed, so we need to install it manually. Once pip is installed, the remaining libraries can be installed as mentioned in the Installing Libraries section. The steps to install pip are as follows:
get-pip.py.get-pip.py. Open the command line in that folder (Bash for Linux users and Terminal for Mac users).python get-pip.py
Please note that you should have Python installed before executing this command.
pip is installed, you can install the desired libraries. To install pandas, you can simply execute pip install pandas. To install a specific version of a library, for example, version 0.24.2 of pandas, you can execute pip install pandas=0.24.2.Anaconda is a Python package manager that easily allows you to install and use the libraries needed for this course.
curl or wget retrieval libraries. The example here shows how to use curl to retrieve the file located at the URL you found on the Anaconda download page:curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
bash Anaconda3-2019.03-Linux-x86_64.sh
Running the preceding command will move you to a very user-friendly installation process. You will be prompted on where you want to install Anaconda and how you wish Anaconda to work. In this case, you should just keep all the standard settings.
Download button for the Anaconda installer.conda create --name my_packt_env python=3.7
Here, we are naming our environment my_packt_env and specifying the version of Python to be 3.7. Thus you can have multiple versions of Python installed in the environment that will be virtually separate.
activate command:conda activate my_packt_env
That's it. You are now in your own customized environment that will allow you to install packages as needed for your projects. To exit your environment, you can simply use the conda deactivate command.
pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/2CnpCEp.
The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.
In Chapter 9, Hotspot Analysis, the basemap module from mpl_toolkits is used to generate maps. This library can be difficult to install. The easiest way is to install Anaconda, which includes mpl_toolkits. Once Anaconda is installed, basemap can be installed using conda install basemap. If you want to avoid installing libraries repeatedly, and instead want to install them all at once, you can follow the instructions in the next section.
It might be that if you are installing dependencies chapter by chapter, the version of the libraries could be different. In order to sync the system, we provide a requirements.txt file that contains the versions of the libraries used. Once you have installed the libraries using this, you don't have to install any other libraries throughout the book. Assuming you have installed Anaconda by now, you can follow these steps:
requirements.txt file from GitHub.requirements.txt is placed and open Command Prompt (Bash for Linux and Terminal for Mac).conda install --yes --file requirements.txt --channel conda-forge
It should install all the packages necessary for the coding activities in the book.
You can find the complete code files of this book at https://packt.live/34kXeMw. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/2ZMUWW0.
We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.
If you have any issues or questions about installation, please email us at [email protected].
Change the font size
Change margin width
Change background colour