In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:
How to share variables across a cluster's nodes
How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
How to handle missing data in the dataset
Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
Which learners are available and how to measure their performance in a distributed environment
How to run cross-validation for hyperparameter optimization in a cluster