Missing value handling is one of the complex areas of data science. There are a variety of techniques that are used to handle missing values depending on the type of missing data and the business use case at hand.
These methods range from simple logic-based methods to advanced statistical methods such as regression and KNN. However, irrespective of the method used to tackle the missing values, we will end up performing one of the following two operations on the missing value data:
Removing the records with missing values from the data
Imputing the missing value entries with some constant value
In this section, we will explore how to do both these operations with PySpark DataFrames.