In this recipe, we will learn how to process data and build two classification models that aim to forecast the forest coverage type: the benchmark logistic regression model and the random forest classifier. The problem we have at hand is multinomial, that is, we have more than two classes that we want to classify our observations into.
To execute this recipe, you will need a working Spark environment and you would have already loaded the data into the forest
DataFrame.
No other prerequisites are required.
Here's the code that will help us build the logistic regression model:
forest_train, forest_test = ( forest .randomSplit([0.7, 0.3], seed=666) ) vectorAssembler = feat.VectorAssembler( inputCols=forest.columns[0:-1] , outputCol='features' ) selector = feat.ChiSqSelector( labelCol='CoverType' , numTopFeatures=10 , outputCol='selected' ) logReg_obj = cl.LogisticRegression( labelCol='CoverType' ...