In this chapter, we have discussed the origin of DataFrames and how Spark SQL provides the SQL interface on top of DataFrame. The power of DataFrames is such that the execution times have decreased over the original RDD-based computations. Having such a powerful layer with a simple SQL-like interface makes it all the more powerful. We also looked at various APIs to create and manipulate DataFrames and dug deeper into the sophisticated features of aggregations, including groupBy
, Window
, rollup
, and cubes
. Finally, we also looked at the concept of joining datasets and the various types of joins possible such as inner, outer, cross, and so on.
We will explore the exciting world of real-time data processing and analytics in Chapter 7, Real-Time Analytics with Apache Spark.