The Spark soul is the resilient distributed dataset. Spark has four design goals: make in-memory (Hadoop is not in-memory) data storage, distribute in a cluster, be fault tolerant, and be fast and efficient.
Fault tolerance is achieved, in part, by applying linear operations on small data chunks. Efficiency is achieved by parallelization of operations throughout all parts of the cluster. Performance is achieved by minimizing data replication between cluster members.
A fundamental concept in Spark is that there are only two types of operations we can do on an RDD:
- Transformations: A new RDD is created from the original; for example, mapping, filtering, union, intersection, sort, join, coalesce
- Actions: The original RDD isn't changed; for example, count, collect, first
It's right when people say that computer science is mathematics with a costume. As we've already seen, in functional programming, functions are first-class citizens; the equivalent in mathematics is...