In this section, we will set forth what our pipeline implementation objectives are. We will document tangible results as we step through individual implementation steps.
Before we implement the Iris pipeline, we want to understand what a pipeline is from a conceptual and practical perspective. Therefore, we define a pipeline as a DataFrame
processing workflow with multiple pipeline stages operating in a certain sequence.
A DataFrame is a Spark abstraction that provides an API. This API lets us work with collections of objects. At a high-level it represents a distributed collection holding rows of data, much like a relational database table. Each member of a row (for example, a Sepal-Width measurement) in this DataFrame falls under a named column called Sepal-Width.
Each stage in a pipeline is an algorithm that is either a Transformer
or an Estimator
. As a DataFrame
or DataFrame(s) flow through the pipeline, two types of stages (algorithms) exist:
Transformer
stage: This involves a transformation action that transforms oneDataFrame
into anotherDataFrame
Estimator
stage: This involves a training action on aDataFrame
that produces anotherDataFrame
.
In summary, a pipeline is a single unit, requiring stages, but inclusive of parameters and DataFrame(s). The entire pipeline structure is listed as follows:
Transformer
Estimator
Parameters
(hyper or otherwise)DataFrame
This is where Spark comes in. Its MLlib library provides a set of pipeline APIs allowing developers to access multiple algorithms and facilitates their combining into a single pipeline of ordered stages, much like a sequence of choreographed motions in a ballet. In this chapter, we will use the random forest classifier.
We covered essential pipeline concepts. These are practicalities that will help us move into the section, where we will list implementation objectives.
Before listing the implementation objectives, we will lay out an architecture for our pipeline. Shown here under are two diagrams representing an ML workflow, a pipeline.
The following diagrams together help in understanding the different components of this project. That said, this pipeline involves training (fitting), transformation, and validation operations. More than one model is trained and the best model (or mapping function) is selected to give us an accurate approximation predicting the species of an Iris flower (based on measurements of those flowers):
Project block diagram
A breakdown of the project block diagram is as follows:
- Spark, which represents the Spark cluster and its ecosystem
- Training dataset
- Model
- Dataset attributes or feature measurements
- An inference process, that produces a prediction column
The following diagram represents a more detailed description of the different phases in terms of the functions performed in each phase. Later we will come to visualize pipeline in terms of its constituent stages.
For now, the diagram depicts four stages, starting with a data pre-processing phase, which is considered separate from the numbered phases deliberately. Think of the pipeline as a two-step process:
- A data cleansing phase, or pre-processing phase. An important phase that could include a subphase of Exploratory Data Analysis (EDA) (not explicitly depicted in the latter diagram).
- A data analysis phase that begins with Feature Extraction, followed by Model Fitting, and Model validation, all the way to deployment of an Uber pipeline JAR into Spark:
Pipeline diagram
Referring to the preceding diagram, the first implementation objective is to set up Spark inside an SBT project. An SBT project is a self-contained application, which we can run on the command line to predict Iris labels. In the SBT project, dependencies are specified in a build.sbt
file and our application code will create its own SparkSession
and SparkContext
.
So that brings us to a listing of implementation objectives and these are as follows:
- Get the Iris dataset from the UCI Machine Learning Repository
- Conduct preliminary EDA in the Spark shell
- Create a new Scala project in IntelliJ, and carry out all implementation steps, until the evaluation of the Random Forest classifier
- Deploy the application to your local Spark cluster
Head over to the UCI Machine Learning Repository website at https://archive.ics.uci.edu/ml/datasets/iris and click on Download:
Data Folder
. Extract this folder someplace convenient and copy over iris.csv
into the root of your project folder.
You may refer back to the project overview for an in-depth description of the Iris dataset. We depict the contents of the iris.csv
file here, as follows:
A snapshot of the Iris dataset with 150 sets
You may recall that the iris.csv
file is a 150-row file, with comma-separated values.
Now that we have the dataset, the first step will be performing EDA on it. The Iris dataset is multivariate, meaning there is more than one (independent) variable, so we will carry out a basic multivariate EDA on it. But we need DataFrame
to let us do that. How we create a dataframe as a prelude to EDA is the goal of the next section.
Before we get down to building the SBT pipeline project, we will conduct a preliminary EDA in spark-shell
. The plan is to derive a dataframe out of the dataset and then calculate basic statistics on it.
We have three tasks at hand for spark-shell
:
- Fire up
spark-shell
- Load the
iris.csv
file and buildDataFrame
- Calculate the statistics
We will then port that code over to a Scala file inside our SBT project.
That said, let's get down to loading the iris.csv
file (inputting the data source) before eventually building DataFrame
.
Fire up the Spark Shell by issuing the following command on the command line.
spark-shell --master local[2]
In the next step, we start with the available Spark session 'spark'. 'spark' will be our entry point to programming with Spark. It also holds properties required to connect to our Spark (local) cluster. With this information, our next goal is to load the iris.csv file and produce a DataFrame
The first step to loading the iris csv file is to invoke the read
method on spark
. The read
method returns DataFrameReader
, which can be used to read our dataset:
val dfReader1 = spark.read dfReader1: org.apache.spark.sql.DataFrameReader=org.apache.spark.sql.DataFrameReader@6980d3b3
dfReader1
is of type org.apache.spark.sql.DataFrameReader
. Calling the format
method on dfReader1
with Spark's com.databricks.spark.csv
CSV format-specifier string returns DataFrameReader
again:
val dfReader2 = dfReader1.format("com.databricks.spark.csv") dfReader2: org.apache.spark.sql.DataFrameReader=org.apache.spark.sql.DataFrameReader@6980d3b3
After all, iris.csv
is a CSV file.
Needless to say, dfReader1
and dfReader2
are the same DataFrameReader
instance.
At this point, DataFrameReader
needs an input data source option
in the form of a key-value pair. Invoke the option
method with two arguments, a key "header"
of type string and its value true
of type Boolean:
val dfReader3 = dfReader2.option("header", true)
In the next step, we invoke the option
method again with an argument inferSchema
and a true
value:
val dfReader4 = dfReader3.option("inferSchema", true)
What is inferSchema
doing here? We are simply telling Spark to guess the schema of our input data source for us.
Up until now, we have been preparing DataFrameReader
to load iris.csv
. External data sources require a path for Spark to load the data for DataFrameReader
to process and spit out DataFrame
.
The time is now right to invoke the load
method on DataFrameReader
dfReader4
. Pass into the load
method the path to the Iris dataset file. In this case, the file is right under the root of the project folder:
val dFrame1 = dfReader4.load("iris.csv")
dFrame1: org.apache.spark.sql.DataFrame = [Id: int, SepalLengthCm: double ... 4 more fields]
That's it. We now have DataFrame
!
Invoking the describe
method on this DataFrame
should cause Spark to perform a basic statistical analysis on each column of DataFrame
:
dFrame1.describe("Id","SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm","Species")
WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res16: org.apache.spark.sql.DataFrame = [summary: string, Id: string ... 5 more fields]
Lets fix the WARN.Utils
issue described in the preceding code block. The fix is to locate the file spark-defaults-template.sh
under SPARK_HOME/conf
and save it as spark-defaults.sh
.
At the bottom of this file, add an entry for spark.debug.maxToStringFields
. The following screenshot illustrates this:
Fixing the WARN Utils problem in spark-defaults.sh
Save the file and restart spark-shell
.
Now, inspect the updated Spark configuration again. We updated the value of spark.debug.maxToStringFields
in the spark-defaults.sh
file. This change is supposed to fix the truncation problem reported by Spark. We will confirm imminently that the change we made caused Spark to update its configuration also. That is easily done by inspecting SparkConf
.
As before, invoking the getConf
returns the SparkContext
instance that stores configuration values. Invoking getAll
on that instance returns an Array
of configuration values. One of those values is an updated value of spark.debug.maxToStringFields
:
sc.getConf.getAll
res4: Array[(String, String)] = Array((spark.repl.class.outputDir,C:\Users\Ilango\AppData\Local\Temp\spark-10e24781-9aa8-495c-a8cc-afe121f8252a\repl-c8ccc3f3-62ee-46c7-a1f8-d458019fa05f), (spark.app.name,Spark shell), (spark.sql.catalogImplementation,hive), (spark.driver.port,58009), (spark.debug.maxToStringFields,150),
That updated value for spark.debug.maxToStringFields
is now 150
.
Run the invoke on the dataframe describe
method and pass to it column names:
val dFrame2 = dFrame1.describe("Id","SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm","Species" ) dFrame2: org.apache.spark.sql.DataFrame = [summary: string, Id: string ... 5 more fields]
The invoke on the describe
method of DataFrame
dfReader
results in a transformed DataFrame
that we call dFrame2. On dFrame2, we invoke the show
method to return a table of statistical results. This completes the first phase of a basic yet important EDA:
val dFrame2Display= = dfReader2.show
The results of the statistical analysis are shown in the following screenshot:
Results of statistical analysis
We did all that extra work simply to demonstrate the individual data reading, loading, and transformation stages. Next, we will wrap all of our previous work in one line of code:
val dfReader = spark.read.format("com.databricks.spark.csv").option("header",true).option("inferSchema",true).load("iris.csv")
dfReader: org.apache.spark.sql.DataFrame = [Id: int, SepalLengthCm: double ... 4 more fields]
That completes the EDA on spark-shell
. In the next section, we undertake steps to implement, build (using SBT), deploy (using spark-submit
), and execute our Spark pipeline application. We start by creating a skeletal SBT project.
Lay out your SBT project in a folder of your choice and name it IrisPipeline
or any name that makes sense to you. This will hold all of our files needed to implement and run the pipeline on the Iris dataset.
The structure of our SBT project looks like the following:
Project structure
We will list dependencies in the build.sbt
file. This is going to be an SBT project. Hence, we will bring in the following key libraries:
- Spark Core
- Spark MLlib
- Spark SQL
The following screenshot illustrates the build.sbt
file:
The build.sbt file with Spark dependencies
The build.sbt
file referenced in the preceding snapshot is readily available for you in the book's download bundle. Drill down to the folder Chapter01
code under ModernScalaProjects_Code
and copy the folder over to a convenient location on your computer.
Drop the iris.csv
file that you downloaded in Step 1 – getting the Iris dataset from the UCI Machine Learning Repository into the root folder of our new SBT project. Refer to the earlier screenshot that depicts the updated project structure with the iris.csv
file inside of it.
Step 4 is broken down into the following steps:
- Create the Scala file
iris.scala
in thecom.packt.modern.chapter1
package. - Up until now, we relied on
SparkSession
andSparkContext
, whichspark-shell
gave us. This time around, we need to createSparkSession
, which will, in turn, give usSparkContext
.
What follows is how the code is laid out in the iris.scala
file.
In iris.scala
, after the package statement, place the following import
statements:
import org.apache.spark.sql.SparkSession
Create SparkSession
inside a trait, which we shall call IrisWrapper
:
lazy val session: SparkSession = SparkSession.builder().getOrCreate()
Just one SparkSession
is made available to all classes extending from IrisWrapper
. Create val
to hold the iris.csv
file path:
val dataSetPath = "<<path to folder containing your iris.csv file>>\\iris.csv"
Create a method to build DataFrame
. This method takes in the complete path to the Iris dataset path as String
and returns DataFrame
:
def buildDataFrame(dataSet: String): DataFrame = {
/*
The following is an example of a dataSet parameter string: "C:\\Your\\Path\\To\\iris.csv"
*/
Import the DataFrame
class by updating the previous import
statement for SparkSession
:
import org.apache.spark.sql.{DataFrame, SparkSession}
Create a nested function inside buildDataFrame
to process the raw dataset. Name this function getRows
. getRows
which takes no parameters but returns Array[(Vector, String)]
. The textFile
method on the SparkContext
variable processes the iris.csv
into RDD[String]
:
val result1: Array[String] = session.sparkContext.textFile(<<path to iris.csv represented by the dataSetPath variable>>)
The resulting RDD contains two partitions. Each partition, in turn, contains rows of strings separated by a newline character, '\n'
. Each row in the RDD represents its original counterpart in the raw data.
In the next step, we will attempt several data transformation steps. We start by applying a flatMap
operation over the RDD, culminating in the DataFrame
creation. DataFrame
is a view over Dataset
, which happens to the fundamental data abstraction unit in the Spark 2.0 line.
We will get started by invoking flatMap
, by passing a function block to it, and successive transformations listed as follows, eventually resulting in Array[(org.apache.spark.ml.linalg.Vector, String)]
. A vector represents a row of feature measurements.
The Scala code to give us Array[(org.apache.spark.ml.linalg.Vector, String)]
is as follows:
//Each line in the RDD is a row in the Dataset represented by a String, which we can 'split' along the new //line character val result2: RDD[String] = result1.flatMap { partition => partition.split("\n").toList } //the second transformation operation involves a split inside of each line in the dataset where there is a //comma separating each element of that line val result3: RDD[Array[String]] = result2.map(_.split(","))
Next, drop the header
column, but not before doing a collection that returns an Array[Array[String]]
:
val result4: Array[Array[String]] = result3.collect.drop(1)
The header column is gone; now import the Vectors
class:
import org.apache.spark.ml.linalg.Vectors
Now, transform Array[Array[String]]
intoArray[(Vector, String)]
:
val result5 = result4.map(row => (Vectors.dense(row(1).toDouble, row(2).toDouble, row(3).toDouble, row(4).toDouble),row(5)))
The last step remaining is to create a final DataFrame
Now, we invoke the createDataFrame
method with a parameter, getRows
. This returns DataFrame
with featureVector
and speciesLabel
(for example, Iris-setosa):
val dataFrame = spark.createDataFrame(result5).toDF(featureVector, speciesLabel)
Display the top 20 rows in the new dataframe:
dataFrame.show +--------------------+-------------------------+ |iris-features-column|iris-species-label-column| +--------------------+-------------------------+ | [5.1,3.5,1.4,0.2]| Iris-setosa| | [4.9,3.0,1.4,0.2]| Iris-setosa| | [4.7,3.2,1.3,0.2]| Iris-setosa| ..................... ..................... +--------------------+-------------------------+ only showing top 20 rows
We need to index the species label column by converting the strings Iris-setosa, Iris-virginica, and Iris-versicolor into doubles. We will use a StringIndexer
to do that.
Now create a file called IrisPipeline.scala
.
Create an object IrisPipeline
that extends our IrisWrapper
trait:
object IrisPipeline extends IrisWrapper {
Import the StringIndexer
algorithm class:
import org.apache.spark.ml.feature.StringIndexer
Now create a StringIndexer
algorithm instance. The StringIndexer
will map our species label column to an indexed learned column:
val indexer = new StringIndexer().setInputCol (irisFeatures_CategoryOrSpecies_IndexedLabel._2).setOutputCol(irisFeatures_CategoryOrSpecies_IndexedLabel._3)
Now, let's split our dataset in two by providing a random seed:
val splitDataSet: Array[org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]] = dataSet.randomSplit(Array(0.85, 0.15), 98765L)
Now our new splitDataset
contains two datasets:
- Train dataset: A dataset containing
Array[(Vector, iris-species-label-column: String)]
- Test dataset: A dataset containing
Array[(Vector, iris-species-label-column: String)]
Confirm that the new dataset is of size 2
:
splitDataset.size res48: Int = 2
Assign the training dataset to a variable, trainSet
:
val trainDataSet = splitDataSet(0) trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]
Assign the testing dataset to a variable, testSet
:
val testDataSet = splitDataSet(1) testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [iris-features-column: vector, iris-species-label-column: string]
Count the number of rows in the training dataset:
trainSet.count res12: Long = 14
Count the number of rows in the testing dataset:
testSet.count res9: Long = 136
There are 150 rows in all.
In reference to Step 5 - DataFrame Creation. This DataFrame 'dataFrame' contains column names that corresponds to the columns present in the DataFrame produced in that step
The first step to create a classifier is to pass into it (hyper) parameters. A fairly comprehensive list of parameters look like this:
- From 'dataFrame' we need the Features column name - iris-features-column
- From 'dataFrame' we also need the Indexed label column name - iris-species-label-column
- The
sqrt
setting forfeatureSubsetStrategy
- Number of features to be considered per split (we have 150 observations and four features that will make our
max_features
value2
) - Impurity settings—values can be gini and entropy
- Number of trees to train (since the number of trees is greater than one, we set a tree maximum depth), which is a number equal to the number of nodes
- The required minimum number of feature measurements (sampled observations), also known as the minimum instances per node
Look at the IrisPipeline.scala
file for values of each of these parameters.
But this time, we will employ an exhaustive grid search-based model selection process based on combinations of parameters, where parameter value ranges are specified.
Create a randomForestClassifier
instance. Set the features and featureSubsetStrategy
:
val randomForestClassifier = new RandomForestClassifier() .setFeaturesCol(irisFeatures_CategoryOrSpecies_IndexedLabel._1) .setFeatureSubsetStrategy("sqrt")
Start building Pipeline
, which has two stages, Indexer
and Classifier
:
val irisPipeline = new Pipeline().setStages(Array[PipelineStage](indexer) ++ Array[PipelineStage](randomForestClassifier))
Next, set the hyperparameter num_trees
(number of trees) on the classifier to 15
, a Max_Depth
parameter, and an impurity with two possible values of gini and entropy.
Build a parameter grid with all three hyperparameters:
val finalParamGrid: Array[ParamMap] = gridBuilder3.build()
Next, we want to split our training set into a validation set and a training set:
val validatedTestResults: DataFrame = new TrainValidationSplit()
On this variable, set Seed
, set EstimatorParamMaps
, set Estimator
with irisPipeline
, and set a training ratio to 0.8
:
val validatedTestResults: DataFrame = new TrainValidationSplit().setSeed(1234567L).setEstimator(irisPipeline)
Finally, do a fit and a transform with our training dataset and testing dataset. Great! Now the classifier is trained. In the next step, we will apply this classifier to testing the data.
The purpose of our validation set is to be able to make a choice between models. We want an evaluation metric and hyperparameter tuning. We will now create an instance of a validation estimator called TrainValidationSplit
, which will split the training set into a validation set and a training set:
val validatedTestResults.setEvaluator(new MulticlassClassificationEvaluator())
Next, we fit this estimator over the training dataset to produce a model and a transformer that we will use to transform our testing dataset. Finally, we perform a validation for hyperparameter tuning by applying an evaluator for a metric.
The new ValidatedTestResults
DataFrame
should look something like this:
--------+ |iris-features-column|iris-species-column|label| rawPrediction| probability|prediction| +--------------------+-------------------+-----+--------------------+ | [4.4,3.2,1.3,0.2]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0| | [5.4,3.9,1.3,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0| | [5.4,3.9,1.7,0.4]| Iris-setosa| 0.0| [40.0,0.0,0.0]| [1.0,0.0,0.0]| 0.0|
Let's return a new dataset by passing in column expressions for prediction
and label
:
val validatedTestResultsDataset:DataFrame = validatedTestResults.select("prediction", "label")
In the line of code, we produced a new DataFrame
with two columns:
- An input label
- A predicted label, which is compared with its corresponding value in the input label column
That brings us to the next step, an evaluation step. We want to know how well our model performed. That is the goal of the next step.
In this section, we will test the accuracy of the model. We want to know how well our model performed. Any ML process is incomplete without an evaluation of the classifier.
That said, we perform an evaluation as a two-step process:
- Evaluate the model output
- Pass in three hyperparameters:
val modelOutputAccuracy: Double = new MulticlassClassificationEvaluator()
Set the label column, a metric name, the prediction column label
, and invoke evaluation with the validatedTestResults
dataset.
Note the accuracy of the model output results on the testing dataset from the modelOutputAccuracy
variable.
The other metrics to evaluate are how close the predicted label value in the 'predicted'
column is to the actual label value in the (indexed) label column.
Next, we want to extract the metrics:
val multiClassMetrics = new MulticlassMetrics(validatedRDD2)
Our pipeline produced predictions. As with any prediction, we need to have a healthy degree of skepticism. Naturally, we want a sense of how our engineered prediction process performed. The algorithm did all the heavy lifting for us in this regard. That said, everything we did in this step was done for the purpose of evaluation. Who is being evaluated here or what evaluation is worth reiterating? That said, we wanted to know how close the predicted values were compared to the actual label value. To obtain that knowledge, we decided to use the MulticlassMetrics
class to evaluate metrics that will give us a measure of the performance of the model via two methods:
- Accuracy
- Weighted precision
The following lines of code will give us value of Accuracy and Weighted Precision. First we will create an accuracyMetrics tuple, which should contain the values of both accuracy and weighted precision
val accuracyMetrics = (multiClassMetrics.accuracy, multiClassMetrics.weightedPrecision)
Obtain the value of accuracy.
val accuracy = accuracyMetrics._1
Next, obtain the value of weighted precision.
val weightedPrecsion = accuracyMetrics._2
These metrics represent evaluation results for our classifier or classification model. In the next step, we will run the application as a packaged SBT application.
At the root of your project folder, issue the sbt console
command, and in the Scala shell, import the IrisPipeline
object and then invoke the main
method of IrisPipeline
with the argument iris
:
sbt console scala> import com.packt.modern.chapter1.IrisPipeline IrisPipeline.main(Array("iris") Accuracy (precision) is 0.9285714285714286 Weighted Precision is: 0.9428571428571428
In the next section, we will show you how to package the application so that it is ready to be deployed into Spark as an Uber JAR.
In the root folder of your SBT application, run:
sbt package
When SBT is done packaging, the Uber JAR can be deployed into our cluster, using spark-submit
, but since we are in standalone deploy mode, it will be deployed into [local]
:
The application JAR file
The package command created a JAR file that is available under the target folder. In the next section, we will deploy the application into Spark.
At the root of the application folder, issue the spark-submit
command with the class and JAR file path arguments, respectively.
If everything went well, the application does the following:
- Loads up the data.
- Performs EDA.
- Creates training, testing, and validation datasets.
- Creates a Random Forest classifier model.
- Trains the model.
- Tests the accuracy of the model. This is the most important part—the ML classification task.
- To accomplish this, we apply our trained Random Forest classifier model to the test dataset. This dataset consists of Iris flower data of so far not seen by the model. Unseen data is nothing but Iris flowers picked in the wild.
- Applying the model to the test dataset results in a prediction about the species of an unseen (new) flower.
- The last part is where the pipeline runs an evaluation process, which essentially is about checking if the model reports the correct species.
- Lastly, pipeline reports back on how important a certain feature of the Iris flower turned out to be. As a matter of fact, the petal width turns out to be more important than the sepal width in carrying out the classification task.
That brings us to the last section of this chapter. We will summarize what we have learned. Not only that, we will give readers a glimpse into what they will learn in the next chapter.