Everyone who has worked with open data will agree that a huge amount of time is needed to clean datasets, with a lot of work to be completed to take care of data accuracy and data incompleteness.
Also, one main task is to merge all the datasets together, as we have separate datasets for crime, education, resource usage, request demand, and transportation from the open datasets. We also have datasets from some separate sources, including census.
In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction and discussed their implementation on Apache Spark. All the techniques discussed there can be applied to our data here.
Besides data merging, we will also need to spend a lot of time on feature development, as we need features to develop our models to obtain insights for this project.
Therefore, for this project, we actually need to conduct data merging, and then feature development and selection...