Using Spark with Parquet files
Apache Parquet is a columnar storage format. It is used in many big data applications in the Hadoop ecosystem. Parquet supports very compression and encoding schemes that can give a significant boost to the performance of such applications. In this section, we show you the simplicity with which you can directly read Parquet files into a standard Spark SQL DataFrame.
Here, we use the reviewsDF created previously from the Amazon reviews contained in a JSON formatted file and write it out in the Parquet format to create the Parquet file. We use coalesce(1)
to create a single output file:
scala> reviewsDF.filter("overall < 3").coalesce(1).write.parquet("file:///Users/aurobindosarkar/Downloads/amazon_reviews/parquet")
In the next step, we create a DataFrame from the Parquet file using just one statement:
scala> val reviewsParquetDF = spark.read.parquet("file:///Users/aurobindosarkar/Downloads/amazon_reviews/parquet/part-00000-3b512935-ec11-48fa-8720-e52a6a29416b...