Summary
Data science requires a process to ensure that the project is successful. As we have seen from the previous frameworks, it requires many moving parts from the extraction of timely data from diverse data sources, building and testing the models, and then deploying those models to aid in or to automate day-to-day decision making processes. Otherwise, the project can easily fall through the gaps in this data so that the organization is right where they started: data rich, information poor.
In this example, we have covered the CRISP-DM methodology and the TDSP methodology. Each of these stages has the data preparation stage clearly marked out. In order to follow this sequence, we have started with a focus on the data preparation stage using the dplyr
package in R. We have cleaned some data and compared the results between the dirty and clean data.