We are now ready to train our TF-IDF NLP model and see if we can classify these transactions as either escalate
or do_not_escalate
.
The following section walks through the steps to train the TF-IDF model.
- Create a new user-defined function,
udf
, to define numerical values for thelabel
column using the following script:
label = F.udf(lambda x: 1.0 if x == 'escalate' else 0.0, FloatType()) df = df.withColumn('label', label('label'))
- Execute the following script to set the TF and IDF columns for the vectorization of the words:
import pyspark.ml.feature as feat TF_ = feat.HashingTF(inputCol="words without stop", outputCol="rawFeatures", numFeatures=100000) IDF_ = feat.IDF(inputCol="rawFeatures", outputCol="features")
- Set up a pipeline,
pipelineTFIDF
, to set the sequence of stages forTF_
andIDF_
using the following script:
pipelineTFIDF...