In this recipe, we explore how to use and interact with Dataset to build a multi-stage machine learning pipeline. Even though the Dataset (conceptually thought of as with strong type-safety) is the way forward, you still have to be able to interact with other machine learning algorithms or codes that return/operate on RDD for either legacy or coding reasons. In this recipe, we also explore to create and from Dataset to RDD and back.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter3
- Import the necessary packages for Spark session to get access to the cluster and
Log4j.Logger
to reduce the amount of output produced by Spark.
import org.apache.log4j.{Level, Logger} import org.apache.spark.sql.SparkSession
- Define a Scala case class to model data for processing.
case class Car...