Storage benefits of different file types
Storage formats are a way to define how data is stored in a file. Hadoop doesn't have a default file format, but it supports multiple file formats for storing data. Some of the common storage formats for Hadoop are as follows:
- Text files
- Sequence files
- Parquet files
- Record-columnar (RC) files
- Optimized row columnar (ORC) files
- Avro files
Choosing a write file format will provide significant advantages, such as the following:
- Optimized performance while reading and writing data
- Schema evaluation support (allows us to change the attributes in a dataset)
- Higher compression, resulting in less storage space being required
- Splittable files (files can be read in parts)
Let's focus on columnar storage formats as they are widely used in big data applications because of how they store data and can be queried by the SQL engine. The columnar format is very useful when a subset of data...