Structured Streaming has been introduced in various places in this chapter, but let's use this recipe to discuss some more details. Structured Streaming is essentially a stream-processing engine built on top of the Spark SQL engine.
An alternative way to look at streaming data is to think of it as an infinite/unbounded table that gets continuously appended as new data arrives.
The four fundamental concepts in Structured Streaming are:
- Input table: To input the table
- Trigger: How often the table gets updated
- Result table: The final table after every trigger update
- Output table: What part of the result to write to storage after every trigger
A query may be interested in only newly appended data (since the last query), all of the data that has been updated (including appended obviously), or the whole table; this leads to the three output modes in Structured Streaming, as follows:
- Append
- Update
- Complete
The DataFrame/Dataset API that is used for bounded tables...