Spark is a distributed computing framework that uses in-memory primitives to process data available in a data store. It provides an in-memory representation of data to be processed and it is well suited for various machine learning algorithms. Spark allows easy connection to different data stores such as HDFS, Cassandra, and Amazon S3.
There are several companies that use Spark for big data processing. The complete list of companies and their use cases is available at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
Spark has two components:
SparkContext is a master service that connects with a cluster manager and acquires resources for
Executor services on worker nodes. For cluster management, Spark supports YARN, Apache Mesos and an in-built standalone cluster manager.
In this section, we'll discuss how Spark is integrated with YARN and how you can submit Spark-YARN applications on a Hadoop-YARN cluster...