Consuming data streams
Similar to a batch processing job, we create a new Spark application using a SparkConf
object and a context. In a streaming application, the context is created using a batch size parameter that will be used for any incoming stream (both GDELT and Twitter layers, part of the same context, will both be tied to the same batch size). GDELT data being published every 15 minutes, our batch size will be naturally 15 minutes as we want to predict categories in a pseudo real-time basis:
val sparkConf = new SparkConf().setAppName("GZET") val ssc = new StreamingContext(sparkConf, Minutes(15)) val sc = ssc.sparkContext
Creating a GDELT data stream
There are many ways of publishing external data into a Spark streaming application. One could open a simple socket and start publishing data over the netcat utility, or could be streaming data through a Flume agent monitoring an external directory. Production systems usually use Kafka as a default broker for both its high throughput and...