-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Mastering Spark for Data Science
By :
The remaining part of our application is to start classifying data. As introduced earlier, the reason for using Twitter was to steal ground truth from external resources. We will train a Naive Bayes classification model using Twitter data while predicting categories of the GDELT URLs. The convenient side of using a Kappa architecture approach is that we do not have to worry much about exporting some common pieces of code across different applications or different environments. Even better, we do not have to export/import our model between a batch and a speed layer (both GDELT and Twitter, sharing the same Spark context, are part of the same physical layer). We could save our model to HDFS for auditing purposes, but we simply need to pass a reference to a Scala object between both classes.
We've already introduced both the concept of bootstrapping a Naive Bayes model using Stack Exchange datasets and the use of a Classifier object that builds...