Apache Parquet is a columnar storage format specifically designed for the Hadoop ecosystem. Traditional row-based storage formats are optimized to work with one record at a time, meaning they can be slow for certain types of workload. Instead, Parquet serializes and stores data by column, thus allowing for optimization of storage, compression, predicate processing, and bulk sequential access across large datasets - exactly the type of workload suited to Spark!
As Parquet implements per column data compaction, it's particularly suited to CSV data, especially with fields of low cardinality, and file sizes can see huge reductions when compared to Avro.
+--------------------------+--------------+ | File Type| Size| +--------------------------+--------------+ |20160101020000.gkg.csv | 20326266| |20160101020000.gkg.avro | 13557119| |20160101020000.gkg.parquet| 6567110| |20160101020000.gkg.csv.bz2| 4028862...