Histograms are the easiest way to visually inspect the distribution of your data. In this recipe, we will show you how to do this in PySpark.
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the no_outliers
DataFrame we created in the Handling outliers recipe, so we assume you have followed the steps to handle duplicates, missing observations, and outliers.
No other prerequisites are required.
There are two ways to produce histograms in PySpark:
- Select feature you want to visualize,
.collect()
it on the driver, and then use the matplotlib's native.hist(...)
method to draw the histogram - Calculate the counts in each histogram bin in PySpark and only return the counts to the driver for visualization
The former solution will work for small datasets (such as ours in this chapter) but it will break your driver if the data is too big. Moreover, there's a good reason why we distribute the data so we can...