Samza's stream processing API
Samza, just like Kafka streams, also provides capabilities, in the form of client libraries, to define a graph of independently-executing processing jobs on the incoming stream of events. The way Samza executes these processing jobs is through the use of Apache Yarn. We will talk about Yarn in more detail shortly. In short, YARN (Yet Another Resource Manager) is Hadoop's next-generation resource scheduler. It allows you to allocate a number of containers (processes) in a cluster of machines, and execute arbitrary commands on them.
Samza uses YARN to manage deployment, fault-tolerance, logging, resource isolation, security, and locality. Together with Kafka and Yarn, Samza provides a complete framework where the complete execution is divided into stages and each stage is represented by a Samza job. This is how all these components come together:
- Samza Client uses YARN to execute Samza jobs
- YARN starts and supervises one or more Samza containers
- Samza Job instances...