In this chapter, we'll walk through the process of downloading and running Apache Spark. We'll first see how to run it in local mode on a single computer, and then we'll run it in cluster mode. We'll also see the Spark's core abstraction for data manipulation, the resilient distributed dataset (RDD). Finally we'll dive into an RDD abstraction called DStreams (or discretized streams), the core part of this chapter is Spark Streaming.
This chapter was written for the Spark newbie, but we don't focus on the data science power of Spark; this chapter is targeted at data engineering and data architecture.
In this chapter, we will learn:
- Spark in single mode
- Spark core concepts
- Resilient distributed datasets
- Spark in cluster mode
- Spark Streaming