We will be exploring Academy Awards demographics in this chapter. You can download the data from the GitHub repository at https://www.crowdflower.com/wp-content/uploads/2016/03/Oscars-demographics-DFE.csv.
This dataset is based on the data provided at http://www.crowdflower.com/data-for-everyone. It contains demographic details such as race, birthplace, and age. Rows are around 400 and it can be easily processed on a simple home computer, so you can do a Proof of Concept (POC) on executing a data science project on Spark.
Just start by downloading the file and inspecting the data. The data may look fine but as you take a closer look, you will notice that it is not "clean". For example, the date of birth column does not follow the same format. Some years are in two-digit format whereas some are in four-digit format. Birthplace does not have country for locations within the USA.
Likewise, you will also notice that the data looks skewed, with more "white" race people from...