In this section, we'll be looking at the second schema-based format, Parquet. The following topics will be covered:
- Saving data in Parquet format
- Loading Parquet data
- Testing
This is a columnar format, as the data is stored column-wise and not row-wise, as we saw in the JSON, CSV, plain text, and Avro files.
This is a very interesting and important format for big data processing and for making the process faster. In this section, we will focus on adding Parquet support to Spark, saving the data into the filesystem, reloading it again, and then testing. Parquet is similar to Avro as it gives you a parquet method but this time, it is a slightly different implementation.
In the build.sbt file, for the Avro format, we need to add an external dependency, but for Parquet, we already have that dependency within Spark. So, Parquet is the way to...