Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python
Preface

Chapter 9: Hotspot Analysis


Activity 21: Estimating Density in One Dimension

Solution:

  1. Open a new notebook and install all the necessary libraries.

    get_ipython().run_line_magic('matplotlib', 'inline')
    
    import matplotlib.pyplot as plt
    import numpy
    import pandas
    import seaborn
    import sklearn.datasets
    import sklearn.model_selection
    import sklearn.neighbors
    
    seaborn.set()
  2. Sample 1,000 data points from the standard normal distribution. Add 3.5 to each of the last 625 values of the sample (that is, the indices between 375 and 1,000). To do this, set a random state of 100 using numpy.random.RandomState to guarantee the same sampled values, and then randomly generate the data points using the randn(1000) call:

    rand = numpy.random.RandomState(100)
    vals = rand.randn(1000)  # standard normal
    vals[375:] += 3.5
  3. Plot the 1,000-point sample data as a histogram and add a scatterplot below it:

    fig, ax = plt.subplots(figsize=(14, 10))
    ax.hist(vals, bins=50, density=True, label='Sampled Values')
    ax.plot(vals, -0.005 - 0.01 * numpy.random.random(len(vals)), '+k', label='Individual Points')
    ax.legend(loc='upper right')

    The output is as follows:

    Figure 9.29: A histogram of the random sample with a scatterplot underneath

  4. Define a grid of bandwidth values. Then, define and fit a grid search cross-validation algorithm:

    bandwidths = 10 ** numpy.linspace(-1, 1, 100)
    
    grid = sklearn.model_selection.GridSearchCV(
        estimator=sklearn.neighbors.KernelDensity(kernel="gaussian"),
        param_grid={"bandwidth": bandwidths},
        cv=10
    )
    grid.fit(vals[:, None])
  5. Extract the optimal bandwidth value:

    best_bandwidth = grid.best_params_["bandwidth"]
    
    print(
        "Best Bandwidth Value: {}"
        .format(best_bandwidth)
    )
  6. Replot the histogram from Step 3 and overlay the estimated density:

    fig, ax = plt.subplots(figsize=(14, 10))
    
    ax.hist(vals, bins=50, density=True, alpha=0.75, label='Sampled Values')
    
    x_vec = numpy.linspace(-4, 8, 10000)[:, numpy.newaxis]
    log_density = numpy.exp(grid.best_estimator_.score_samples(x_vec))
    ax.plot(
         x_vec[:, 0], log_density, 
         '-', linewidth=4, label='Kernel = Gaussian'
    )
    
    ax.legend(loc='upper right')

    The output is as follows:

    Figure 9.30: A histogram of the random sample with the optimal estimated density overlaid

Activity 22: Analyzing Crime in London

Solution:

  1. Load the crime data. Use the path where you saved the downloaded directory, create a list of the year-month tags, use the read_csv command to load the individual files iteratively, and then concatenate these files together:

    base_path = (
        "~/Documents/packt/unsupervised-learning-python/"
        "lesson-9-hotspot-models/metro-jul18-dec18/"
        "{yr_mon}/{yr_mon}-metropolitan-street.csv"
    )
    
    print(base_path)
    
    yearmon_list = [
        "2018-0" + str(i) if i <= 9 else "2018-" + str(i) 
        for i in range(7, 13)
    ]
    
    print(yearmon_list)
    
    data_yearmon_list = []
    
    for idx, i in enumerate(yearmon_list):
        df = pandas.read_csv(
            base_path.format(yr_mon=i), 
            header=0
        )
        
        data_yearmon_list.append(df)
        
        if idx == 0:
            print("Month: {}".format(i))
            print("Dimensions: {}".format(df.shape))
            print("Head:\n{}\n".format(df.head(2)))
    
    london = pandas.concat(data_yearmon_list)

    The output is as follows:

    Figure 9.31: An example of one of the individual crime files

    This printed information is just for the first of the loaded files, which will be the criminal information from the Metropolitan Police Service for July 2018. This one file has nearly 100,000 entries. You will notice that there is a great deal of interesting information in this dataset, but we will focus on Longitude, Latitude, Month, and Crime type.

  2. Print diagnostics of the complete (six months) and concatenated dataset:

    print(
        "Dimensions - Full Data:\n{}\n"
        .format(london.shape)
    )
    print(
        "Unique Months - Full Data:\n{}\n"
        .format(london["Month"].unique())
    )
    print(
        "Number of Unique Crime Types - Full Data:\n{}\n"
        .format(london["Crime type"].nunique())
    )
    print(
        "Unique Crime Types - Full Data:\n{}\n"
        .format(london["Crime type"].unique())
    )
    print(
        "Count Occurrences Of Each Unique Crime Type - Full Type:\n{}\n"
        .format(london["Crime type"].value_counts())
    )

    The output is as follows:

    Figure 9.32: Descriptors of the full crime dataset

  3. Subset the DataFrame down to four variables (Longitude, Latitude, Month, and Crime type):

    london_subset = london[["Month", "Longitude", "Latitude", "Crime type"]]
    london_subset.head(5)

    The output is as follows:

    Figure 9.33: Crime data in DataFrame form subset down to the Longitude, Latitude, Month, and Crime type columns

  4. Using the jointplot function from seaborn, fit and visualize three kernel density estimation models for bicycle theft in July, September, and December 2018:

    crime_bicycle_jul = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-07")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_jul, kind="kde")

    The output is as follows:

    Figure 9.34: The estimated joint and marginal densities for bicycle thefts in July 2018

    crime_bicycle_sept = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-09")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_sept, kind="kde")

    The output is as follows:

    Figure 9.35: The estimated joint and marginal densities for bicycle thefts in September 2018

    crime_bicycle_dec = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-12")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_dec, kind="kde")

    The output is as follows:

    Figure 9.36: The estimated joint and marginal densities for bicycle thefts in December 2018

    From month to month, the density of bicycle thefts stays quite constant. There are slight differences between the densities, which is to be expected given that the data that is the foundation of these estimated densities is three one-month samples. Given these results, police or criminologists should be confident in predicting where future bicycle thefts are most likely to occur.

  5. Repeat Step 4; this time, use shoplifting crimes for the months of August, October, and November 2018:

    crime_shoplift_aug = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-08")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_aug, kind="kde")

    The output is as follows:

    Figure 9.37: The estimated joint and marginal densities for shoplifting incidents in August 2018

    crime_shoplift_oct = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-10")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_oct, kind="kde")

    The output is as follows:

    Figure 9.38: The estimated joint and marginal densities for shoplifting incidents in October 2018

    crime_shoplift_nov = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-11")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_nov, kind="kde")

    The output is as follows:

    Figure 9.39: The estimated joint and marginal densities for shoplifting incidents in November 2018

    Like the bicycle theft results, the shoplifting densities are quite stable across the months. The density from August 2018 looks different from the other two months; however, if you look at the longitude and latitude values, you will notice that the density is very similar, but it has just shifted and scaled. The reason for this is that there were probably a number of outliers forcing the creation of a much larger plotting region.

  6. Repeat Step 5; this time use burglary crimes for the months of July, October, and December 2018:

    crime_burglary_jul = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-07")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_jul, kind="kde")

    The output is as follows:

    Figure 9.40: The estimated joint and marginal densities for burglaries in July 2018

    crime_burglary_oct = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-10")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_oct, kind="kde")

    The output is as follows:

    Figure 9.41: The estimated joint and marginal densities for burglaries in October 2018

    crime_burglary_dec = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-12")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_dec, kind="kde")

    The output is as follows:

    Figure 9.42: The estimated joint and marginal densities for burglaries in December 2018

    Once again, we can see that the distributions are quite similar across the months. The only difference is that the densities seem to widen or spread from July to December. As always, the noise and inherent lack of information contained in the sample data is causing small shifts in the estimated densities.