This chapter has given a whistle-stop tour through storage on a Hadoop cluster. In particular, we covered:
The high-level architecture of HDFS, the main filesystem used in Hadoop
How HDFS works under the covers and, in particular, its approach to reliability
How Hadoop 2 has added significantly to HDFS, particularly in the form of NameNode HA and filesystem snapshots
What ZooKeeper is and how it is used by Hadoop to enable features such as NameNode automatic failover
An overview of the command-line tools used to access HDFS
The API for filesystems in Hadoop and how at a code level HDFS is just one implementation of a more flexible filesystem abstraction
How data can be serialized onto a Hadoop filesystem and some of the support provided in the core classes
The various file formats available in which data is most frequently stored in Hadoop and some of their particular use cases
In the next chapter, we will look in detail at how Hadoop provides processing frameworks that can be used to process...