Descriptive statistics are the most fundamental measures you can calculate on your data. In this recipe, we will learn how easy it is to get familiar with our dataset in PySpark.
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the no_outliers
DataFrame we created in the Handling outliers recipe so we assume you have followed the steps to handle duplicates, missing observations, and outliers.
No other prerequisites are required.
Calculating the descriptive statistics for your data is extremely easy in PySpark. Here's how:
descriptive_stats = no_outliers.describe(features)
That's it!
The preceding code barely needs an explanation. The .describe(...)
method takes a list of columns you want to calculate the descriptive statistics on and returns a DataFrame with basic descriptive statistics: count, mean, standard deviation, minimum value, and maximum value.