Choosing the right file types for analytical queries
In the previous section, we discussed the three file formats—Avro, Parquet, and ORC—in detail. Any format that supports fast reads is better for analytics workloads. So, naturally, column-based formats such as Parquet and ORC fit the bill.
Based on the five core areas that we compared before (read, write, compression, schema evolution, and the ability to split files for parallel processing) and your choice of processing technologies, such as Hive or Spark, you could select ORC or Parquet.
For example, consider the following:
- If you have Hive- or Presto-based workloads, go with ORC.
- If you have Spark- or Drill-based workloads, go with Parquet.
Now that you understand the different types of file formats available and the ones to use for analytical workloads, let's move on to the next topic: designing for efficient querying.