Chapter 2. Using Spark SQL for Processing Structured and Semistructured Data
In this chapter, we will familiarize you with using Spark SQL with different types of data sources and data storage formats. Spark provides easy and standard structures (that is, RDDs and DataFrames/Datasets) to work with both structured and semistructured data. We include some of the data sources that are most commonly used in big data applications, such as, relational data, NoSQL databases, and files (CSV, JSON, Parquet, and Avro). Spark also allows you to define and use custom data sources. A series of hands-on exercises in this chapter will enable you to use Spark with different types of data sources and data formats.
In this chapter, you shall learn the following topics:
- Understanding data sources in Spark applications
- Using JDBC to work with relational databases
- Using Spark with MongoDB (NoSQL database)
- Working with JSON data
- Using Spark with Avro and Parquet Datasets