Understanding data sources in Spark applications
Spark can to many different data sources, files, and SQL and NoSQL databases. Some of the more popular data sources include files (CSV, JSON, Parquet, AVRO), MySQL, MongoDB, HBase, and Cassandra.
In addition, it can also connect to special purpose engines and data sources, such as ElasticSearch, Apache Kafka, and Redis. These engines enable specific functionality in Spark applications such as search, streaming, caching, and so on. For example, enables deployment of cached machine learning models in high performance applications. We discuss more on Redis-based application deployment in Chapter 12, Spark SQL in Large-Scale Application Architectures. Kafka is extremely popular in Spark streaming applications, and we will cover more details on Kafka-based streaming applications in Chapter 5, Using Spark SQL in Streaming Applications, and Chapter 12, Spark SQL in Large-Scale Application Architectures. The DataSource API enables connectivity...