The parallelism and scalability of the MapReduce computing paradigm are greatly influenced by the underlying filesystem. HDFS is the default filesystem that comes with most Hadoop distributions. The filesystem automatically chunks files into blocks and stores them in a replicated fashion across the cluster. The information of the distribution pattern is supplied to the MapReduce engine that can then smartly place tasks so that movement of data over the network is minimized.
However, there are many use cases where HDFS may not be ideal. In this chapter, we will look at the following topics:
The strengths and drawbacks of HDFS when compared to other POSIX filesystems.
Hadoop's support for other filesystems. One of them is Amazon's cloud storage service known as Simple Storage Service (S3). Reading and writing files from and to the S3 services is permitted within Hadoop.
Hadoop HDFS has extensibility features. Extending the framework can be of two kinds: by providing...