-
Book Overview & Buying
-
Table Of Contents
Learning PySpark
By :
Therefore, it is time to create our final dataset that we will use to build our models. We will convert our DataFrame into an RDD of LabeledPoints.
A LabeledPoint is a MLlib structure that is used to train the machine learning models. It consists of two attributes: label and features.
The label is our target variable and features can be a NumPy array, list, pyspark.mllib.linalg.SparseVector, pyspark.mllib.linalg.DenseVector, or scipy.sparse column matrix.
Before we build our final dataset, we first need to deal with one final obstacle: our 'BIRTH_PLACE' feature is still a string. While any of the other categorical variables can be used as is (as they are now dummy variables), we will use a hashing trick to encode the 'BIRTH_PLACE' feature:
import pyspark.mllib.feature as ft
import pyspark.mllib.regression as reg
hashing = ft.HashingTF(7)
births_hashed = births_transformed \
.rdd \
.map(lambda row: [
list(hashing.transform...