Summary
Although we were impressed with many of the overall model consistencies, we appreciate that we certainly did not build the most accurate classification system ever. Crowd sourcing this task to millions of users was an ambitious task and by far not the easiest way of getting clearly defined categories. However, this simple proof of concept shows us a few important things:
It technically validates our Spark Streaming architecture.
It validates our assumption of bootstrapping GDELT using an external dataset.
It made us lazy, impatient, and proud.
It learns without any supervision and eventually gets better at every batch.
No data scientist can build a fully functional and highly accurate classification system in just a few weeks, especially not on dynamic data; a proper classifier needs to be evaluated, trained, re-evaluated, tuned, and retrained for at least the first few months, and then re-evaluated every half a year at the very least. Our goal here was to describe the components involved...