Understanding DataFrame/Dataset APIs
A Dataset is a strongly collection of domain-specific objects that can be transformed parallelly, using functional or relational operations. Each Dataset also a view called a DataFrame, which is not strongly typed and is essentially a Dataset of row objects.
Spark SQL applies structured views to the data from different source systems stored using different data formats. Structured APIs, such as the DataFrame/Dataset API, allows developers to use a high-level API to write their programs. These APIs allow them to focus on the "what" rather than the "how" of the data processing required.
Even though applying a structure can limit what can be expressed, in practice, structured APIs can accommodate the vast majority of computations required in application development. Also, it is these very limitations (imposed by structured APIs) that present several of the main optimization opportunities.
In the next section, we will explore encoders and their role in efficient...