What are Hadoop and HDFS
Hadoop Distributed File System (HDFS) is a block-enabled distributed filesystem designed to store huge amounts of data in a reliable manner on a cluster of computers with commodity hardware. HDFS works by creating another virtual layer on top of the normal filesystem that spans multiple computers in the cluster. It stores data/files containing data by collecting data into coarser-grained blocks (for example, 128 MB). Since Hadoop is meant to handle huge amounts of data, the HDFS file size is meant to club big files into a single chunk and thus HDFS performs better when the files are large as it can fetch the entire data in one call to the cluster. During storage time, HDFS partitions the large files into blocks and distribute these large blocks across all the nodes of the cluster. This enables parallel aggregate read operations that perform at a very high speed and efficiency. Multiple copies of these blocks are also stored to enable reliability and fault-tolerance...