Missing observations are pretty much the second-most-common issue in datasets. These arise for many reasons, as we have already alluded to in the introduction. In this recipe, we will learn how to deal with them.
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the new_id
DataFrame we created in the previous recipe, so we assume you have followed the steps to remove the duplicated records.
No other prerequisites are required.
Since our data has two dimensions (rows and columns), we need to check the percentage of data missing in each row and each column to make a determination of what to keep, what to drop, and what to (potentially) impute:
- To calculate how many missing observations there are in a row, use the following snippet:
( spark.createDataFrame( new_id .rdd .map( lambda row: ( row['Id'] , sum([c == None for...