Chapter 3. Using Spark SQL for Data Exploration
In this chapter, we will introduce you to using Spark SQL for exploratory data analysis. We will introduce preliminary techniques to compute some basic statistics, identify outliers, and visualize, sample, and pivot data. A series of hands-on exercises in this chapter will enable you to use Spark SQL along with tools such as Apache Zeppelin for developing an intuition about your data.
In this chapter, we shall look at the following topics:
- What is Exploratory Data Analysis (EDA)
- Why is EDA important?
- Using Spark SQL for basic data analysis
- Visualizing data with Apache Zeppelin
- Sampling data with Spark SQL APIs
- Using Spark SQL for creating pivot tables