Book Image

Interactive Data Visualization with Python - Second Edition

By : Abha Belorkar, Sharath Chandra Guntuku, Shubhangi Hora, Anshu Kumar
Book Image

Interactive Data Visualization with Python - Second Edition

By: Abha Belorkar, Sharath Chandra Guntuku, Shubhangi Hora, Anshu Kumar

Overview of this book

With so much data being continuously generated, developers, who can present data as impactful and interesting visualizations, are always in demand. Interactive Data Visualization with Python sharpens your data exploration skills, tells you everything there is to know about interactive data visualization in Python. You'll begin by learning how to draw various plots with Matplotlib and Seaborn, the non-interactive data visualization libraries. You'll study different types of visualizations, compare them, and find out how to select a particular type of visualization to suit your requirements. After you get a hang of the various non-interactive visualization libraries, you'll learn the principles of intuitive and persuasive data visualization, and use Bokeh and Plotly to transform your visuals into strong stories. You'll also gain insight into how interactive data and model visualization can optimize the performance of a regression model. By the end of the course, you'll have a new skill set that'll make you the go-to person for transforming data visualizations into engaging and interesting stories.
Table of Contents (9 chapters)

Plotting with pandas and seaborn

Now that we have a basic sense of how to load and handle data in a pandas DataFrame object, let's get started with making some simple plots from data. While there are several plotting libraries in Python (including matplotlib, plotly, and seaborn), in this chapter, we will mainly explore the pandas and seaborn libraries, which are extremely useful, popular, and easy to use.

Creating Simple Plots to Visualize a Distribution of Variables

matplotlib is a plotting library available in most Python distributions and is the foundation for several plotting packages, including the built-in plotting functionality of pandas and seaborn. matplotlib enables control of every single aspect of a figure and is known to be verbose. Both seaborn and pandas visualization functions are built on top of matplotlib. The built-in plotting tool of pandas .is a useful exploratory tool to generate figures that are not ready for primetime but useful to understand the dataset you are working with. seaborn, on the other hand, has APIs to draw a wide variety of aesthetically pleasing plots.

To illustrate certain key concepts and explore the diamonds dataset, we will start with two simple visualizations in this chapter—histograms and bar plots.

Histograms

A histogram of a feature is a plot with the range of the feature on the x-axis and the count of data points with the feature in the corresponding range on the y-axis.

Let's look at the following exercise of plotting a histogram with pandas.

Exercise 8: Plotting and Analyzing a Histogram

In this exercise, we will create a histogram of the frequency of diamonds in the dataset with their respective carat specifications on the x-axis:

  1. Import the necessary modules:
    import seaborn as sns
    import pandas as pd
  2. Import the diamonds dataset from seaborn:
    diamonds_df = sns.load_dataset('diamonds')
  3. Plot a histogram using the diamonds dataset where x axis = carat:
    diamonds_df.hist(column='carat')

    The output is as follows:

    Figure 1.14: Histogram plot
    Figure 1.14: Histogram plot

    The y axis in this plot denotes the number of diamonds in the dataset with the carat specification on the x-axis.

    The hist function has a parameter called bins, which literally refers to the number of equally sized bins into which the data points are divided. By default, the bins parameter is set to 10 in pandas. We can change this to a different number, if we wish.

  4. Change the bins parameter to 50:
    diamonds_df.hist(column='carat', bins=50)

    The output is as follows:

    Figure 1.15: Histogram with bins = 50
    Figure 1.15: Histogram with bins = 50

    This is a histogram with 50 bins. Notice how we can see a more fine-grained distribution as we increase the number of bins. It is helpful to test with multiple bin sizes to know the exact distribution of the feature. The range of bin sizes varies from 1 (where all values are in the same bin) to the number of values (where each value of the feature is in one bin).

  5. Now, let's look at the same function using seaborn:
    sns.distplot(diamonds_df.carat)

    The output is as follows:

    Figure 1.16: Histogram plot using seaborn
    Figure 1.16: Histogram plot using seaborn

    There are two noticeable differences between the pandas hist function and seaborn distplot:

    • pandas sets the bins parameter to a default of 10, but seaborn infers an appropriate bin size based on the statistical distribution of the dataset.
    • By default, the distplot function also includes a smoothed curve over the histogram, called a kernel density estimation.

      The kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Usually, a KDE doesn't tell us anything more than what we can infer from the histogram itself. However, it is helpful when comparing multiple histograms on the same plot. If we want to remove the KDE and look at the histogram alone, we can use the kde=False parameter.

  6. Change kde=False to remove the KDE:
    sns.distplot(diamonds_df.carat, kde=False)

    The output is as follows:

    Figure 1.17: Histogram plot with KDE = false
    Figure 1.17: Histogram plot with KDE = false

    Also note that the bins parameter seemed to render a more detailed plot when the bin size was increased from 10 to 50. Now, let's try to increase it to 100.

  7. Increase the bins size to 100:
    sns.distplot(diamonds_df.carat, kde=False, bins=100)

    The output is as follows:

    Figure 1.18: Histogram plot with increased bin size
    Figure 1.18: Histogram plot with increased bin size

    The histogram with 100 bins shows a better visualization of the distribution of the variable—we see there are several peaks at specific carat values. Another observation is that most carat values are concentrated toward lower values and the tail is on the right—in other words, it is right-skewed.

    A log transformation helps in identifying more trends. For instance, in the following graph, the x-axis shows log-transformed values of the price variable, and we see that there are two peaks indicating two kinds of diamonds—one with a high price and another with a low price.

  8. Use a log transformation on the histogram:
    import numpy as np
    sns.distplot(np.log(diamonds_df.price), kde=False)

    The output is as follows:

Figure 1.19: Histogram using a log transformation
Figure 1.19: Histogram using a log transformation

That's pretty neat. Looking at the histogram, even a naive viewer immediately gets a picture of the distribution of the feature. Specifically, three observations are important in a histogram:

  • Which feature values are more frequent in the dataset (in this case, there is a peak at around 6.8 and another peak between 8.5 and 9—note that log(price) = values, in this case,
  • How many peaks exist in the data (the peaks need to be further inspected for possible causes in the context of the data)
  • Whether there are any outliers in the data

Bar Plots

Another type of plot we will look at in this chapter is the bar plot.

In their simplest form, bar plots display counts of categorical variables. More broadly, bar plots are used to depict the relationship between a categorical variable and a numerical variable. Histograms, meanwhile, are plots that show the statistical distribution of a continuous numerical feature.

Let's see an exercise of bar plots in the diamonds dataset. First, we shall present the counts of diamonds of each cut quality that exist in the data. Second, we shall look at the price associated with the different types of cut quality (Ideal, Good, Premium, and so on) in the dataset and find out the mean price distribution. We will use both pandas and seaborn to get a sense of how to use the built-in plotting functions in both libraries.

Before generating the plots, let's look at the unique values in the cut and clarity columns, just to refresh our memory.

Exercise 9: Creating a Bar Plot and Calculating the Mean Price Distribution

In this exercise, we'll learn how to create a table using the pandas crosstab function. We'll use a table to generate a bar plot. We'll then explore a bar plot generated using the seaborn library and calculate the mean price distribution. To do so, let's go through the following steps:

  1. Import the necessary modules and dataset:
    import seaborn as sns
    import pandas as pd
  2. Import the diamonds dataset from seaborn:
    diamonds_df = sns.load_dataset('diamonds')
  3. Print the unique values of the cut column:
    diamonds_df.cut.unique()

    The output will be as follows:

    array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
  4. Print the unique values of the clarity column:
    diamonds_df.clarity.unique()

    The output will be as follows:

    array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
          dtype=object)

    Note

    unique() returns an array. There are five unique cut qualities and eight unique values in clarity. The number of unique values can be obtained using nunique() in pandas.

  5. To obtain the counts of diamonds of each cut quality, we first create a table using the pandas crosstab() function:
    cut_count_table = pd.crosstab(index=diamonds_df['cut'],columns='count')
    cut_count_table

    The output will be as follows:

    Figure 1.20: Table using the crosstab function
    Figure 1.20: Table using the crosstab function
  6. Pass these counts to another pandas function, plot(kind='bar'):
    cut_count_table.plot(kind='bar')

    The output will be as follows:

    Figure 1.21: Bar plot using a pandas DataFrame
    Figure 1.21: Bar plot using a pandas DataFrame

    We see that most of the diamonds in the dataset are of the Ideal cut quality, followed by Premium, Very Good, Good, and Fair. Now, let's see how to generate the same plot using seaborn.

  7. Generate the same bar plot using seaborn:
    sns.catplot("cut", data=diamonds_df, aspect=1.5, kind="count", color="b")

    The output will be as follows:

    Figure 1.22: Bar plot using seaborn
    Figure 1.22: Bar plot using seaborn

    Notice how the catplot() function does not require us to create the intermediate count table (using pd.crosstab()), and reduces one step in the plotting process.

  8. Next, here is how we obtain the mean price distribution of different cut qualities using seaborn:
    import seaborn as sns
    from numpy import median, mean
    sns.set(style="whitegrid")
    ax = sns.barplot(x="cut", y="price", data=diamonds_df,estimator=mean)

    The output will be as follows:

    Figure 1.23: Bar plot with the mean price distribution
    Figure 1.23: Bar plot with the mean price distribution

    Here, the black lines (error bars) on the rectangles indicate the uncertainty (or spread of values) around the mean estimate. By default, this value is set to 95% confidence. How do we change it? We use the ci=68 parameter, for instance, to set it to 68%. We can also plot the standard deviation in the prices using ci=sd.

  9. Reorder the x axis bars using order:
    ax = sns.barplot(x="cut", y="price", data=diamonds_df, estimator=mean, ci=68, order=['Ideal','Good','Very Good','Fair','Premium'])

    The output will be as follows:

    Figure 1.24: Bar plot with proper order
Figure 1.24: Bar plot with proper order

Grouped bar plots can be very useful for visualizing the variation of a particular feature within different groups. Now that you have looked into tweaking the plot parameters in a grouped bar plot, let's see how to generate a bar plot grouped by a specific feature.

Exercise 10: Creating Bar Plots Grouped by a Specific Feature

In this exercise, we will use the diamonds dataset to generate the distribution of prices with respect to color for each cut quality. In Exercise 9, Creating a Bar Plot and Calculating the Mean Price Distribution, we looked at the price distribution for diamonds of different cut qualities. Now, we would like to look at the variation in each color:

  1. Import the necessary modules—in this case, only seaborn:
    #Import seaborn
    import seaborn as sns
  2. Load the dataset:
    diamonds_df = sns.load_dataset('diamonds')
  3. Use the hue parameter to plot nested groups:
    ax = sns.barplot(x="cut", y="price", hue='color', data=diamonds_df)

    The output is as follows:

Figure 1.25: Grouped bar plot with legends
Figure 1.25: Grouped bar plot with legends

Here, we can observe that the price patterns for diamonds of different colors are similar for each cut quality. For instance, for Ideal diamonds, the price distribution of diamonds of different colors is the same as that for Premium, and other diamonds.