Chapter 1
Introduction to Spark Distributed Processing
Section 9
Introduction to SQL, Datasets, and DataFrames
Before we understand how each of these function along with Spark, let us first know what they mean. A dataset is a distributed collection that provides additional metadata about the structure of the data that is stored. A DataFrame is a dataset that organizes information into named columns. DataFrames can be built from different sources, such as JSON, XML, and databases. In this section, let us cover each of them in detail. For further information on movielens datasets, do check this link - https://grouplens.org/datasets/movielens/