Book Image

Learning Hadoop 2

By : Gerald Turkington, GABRIELE MODENA
Book Image

Learning Hadoop 2

By: Gerald Turkington, GABRIELE MODENA

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
About the Authors
About the Reviewers

Storing data

Until now, we introduced the architecture of HDFS and how to programmatically store and retrieve data using the command-line tools and the Java API. In the examples seen until now, we have implicitly assumed that our data was stored as a text file. In reality, some applications and datasets will require ad hoc data structures to hold the file's contents. Over the years, file formats have been created to address both the requirements of MapReduce processing—for instance, we want data to be splittable—and to satisfy the need to model both structured and unstructured data. Currently, a lot of focus has been dedicated to better capture the use cases of relational data storage and modeling. In the remainder of this chapter, we will introduce some of the popular file format choices available within the Hadoop ecosystem.

Serialization and Containers

When talking about file formats, we are assuming two types of scenarios, which are as follows:

  • Serialization: we want to encode data structures...