Preparing data for machine learning
In this section, we introduce the of preparing the data prior to applying Spark MLlib algorithms. Typically, we need to have two columns called label and features for using Spark MLlib classification algorithms. We will illustrate this with the following example described:
We import the required classes for this section:
scala> import org.apache.spark.ml.Pipeline scala> import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier} scala> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator scala> import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} scala> import org.apache.spark.ml.linalg.Vectors
Pre-processing data for machine learning
We define a set of UDFs
used in this section. These include, for example, checking whether a string contains a specific substring or not, and returning a 0.0
or 1.0
value to the label column. Another UDF
is used to create...