The best way to get started is by understanding the bigger picture—gauging the magnitude of the work ahead of us. In this sense, we have identified two broad tasks:
- Setting up the prerequisite software.
- Developing two pipelines, starting with data collection and building a workflow sequence that could end with predictions. Those pipelines are as follows:
- A Random Forests pipeline
- A logistical regression pipeline
We will talk about setting up the prerequisite software in the next section.
First, please refer back to the Setting up the prerequisite software section in Chapter 1, Predict the Class of a Flower from the Iris Dataset, to review your existing infrastructure. If need be, you might want to install everything again. The chances of you having to substantively change anything are slim.
However, here are the upgrades I recommend:
- JDK upgrade to 1.8.0_172, if you have not already done so
- Scala from 2.11.12 to an early stable version of 2.12
- Spark...