Apache Hadoop is an open source platform for developing and deploying big data applications. It was initially developed at Yahoo! based on the MapReduce and Google File System papers published by Google. Over the past few years, Hadoop has become the flagship big data platform.
In this section, we will discuss the key components of a Hadoop cluster.
This is the base library on which other Hadoop modules are based. It provides an abstraction for OS and filesystem operations so that Hadoop can be deployed on a variety of platforms.
Commonly known as HDFS, the Hadoop Distributed File System is a scalable, distributed, fault-tolerant filesystem. HDFS acts as the storage layer of the Hadoop ecosystem. It allows the sharing and storage of data and application code among the various nodes in a Hadoop cluster.
The following are the key assumptions taken while designing HDFS:
- It should be deployable on a cluster of commodity hardware.
- Hardware...