Big data processing involves data representation either in storage or in transit over the network. Compact representation, fast transformations, extensibility, and backward compatibility of the data representation are desired properties. Some key takeaways from this chapter related to data representation are as follows:
Hadoop provides inbuilt serialization/deserialization mechanisms using the
Writableclasses are serialized more compactly than Java serialization.
Avro is a flexible and extensible data serialization framework. It serializes data in binary and is supported by Hadoop, MapReduce, Pig, and Hive.
Avro provides dynamic typing, eliminating the need for code generation. The schema can be stored with the data and read by any subsystem.
Compression techniques trade speed and storage savings. Hadoop supports many compression codecs along this tradeoff spectrum. Compression is a very important optimization parameter for big data processing.