Summary
HDFS is a great filesystem for MapReduce workloads. But its sequential access pattern and non-compliance with POSIX interfaces make it tedious to work with in certain situations. Hadoop allows its users to extend HDFS or provide drop-in replacements. The key takeaways from this chapter are as follows:
There are a number of implementations that extend or provide drop-in replacements for HDFS. CephFS, MapRFS, GPFS from IBM, and Cassandra by DataStax are some examples of such extensions.
Interface to the Amazon S3 storage service is available out of the box in Hadoop. Both a native-storage S3 filesystem interface and a block-storage filesystem interface are available.
Extending Hadoop to incorporate other filesystems is done by extending the
FileSystem
abstract base class. TheFSDataInputStream
andFSDataOutputStream
objects are used to wrap the input and output streams of the underlying filesystem respectively.The security and access control mechanisms of the underlying filesystem can...