Chapter 4. Using Spark SQL for Data Munging
In this code-intensive chapter, we will present key data munging techniques used to transform raw data to a usable format for analysis. We start with some general data munging steps that are applicable in a wide variety of scenarios. Then, we shift our focus to specific types of data including time-series data, text, and data preprocessing steps for Spark MLlib-based machine learning pipelines. We will use several Datasets to illustrate these techniques.
In this chapter, we shall learn:
- What is data munging?
- Explore data munging techniques
- Combine data using joins
- Munging on textual data
- Munging on time-series data
- Dealing with variable length records
- Data preparation for machine learning pipelines