In this chapter, we have covered details around Spark SQL, the DataFrame API, the Dataset API, Catalyst optimiser, the nuances of SparkSession, creating a DataFrame, manipulating a DataFrame, converting a DataFrame to RDD, and vice-versa before providing examples of working with DataFrames. This is by no means a complete reference for SparkSQL and is perhaps just a good starting point for people planning to embark on the journey of Spark via the SQL route. We have looked at how you can use your favorite API without consideration of performance, as Spark will choose an optimum execution plan.
The next chapter is one of my favorite topics - Spark MLLib. Spark provides a rich API for predictive modeling and the use of Spark MLLib is increasing every day. We'll look at the basics of machine learning before providing users with an insight into how the Spark framework provides support for performing predictive analytics. We'll cover topics from building a machine-learning pipeline, feature...