Let's start with a simple example of streaming in which in one terminal, we will type some text and the streaming application will capture it in another window.
- Start the Spark shell:
$ spark-shell
- Create a DataFrame to read what's coming on port
8585
:
scala> val lines = spark.readStream.format("socket").option("host","localhost").option("port",8585).load
- Cast the lines from DataFrame to Dataset with the
String
datatype and then flatten it:
scala> val words = lines.as[String].flatMap(_.split(" "))
- Do the word count:
scala> val wordCounts = words.groupBy("value").count()
- Start the
netcat
server in a separate window:
$ nc -lk 8585
- Come back to the previous terminal and print the complete set of counts to the console every time it is updated:
scala> val query = wordCounts.writeStream.outputMode("complete").format("console").start()
- Now go back to the terminal where you started netcat and enter different lines, such as
to be or not to be
...