Hadoop distributed file system (HDFS) is a subproject of Apache foundation. It is designed to maintain large data/files in a distributed manner reliably. HDFS uses master-slave based architecture and is designed to run on low- cost hardware. It is a distributed file system which provides high speed data access across distributed network. It also provides APIs to manage its file system. To handle failures of nodes, HDFS effectively uses data replication of file blocks across multiple Hadoop cluster nodes, thereby avoiding any data loss during node failures. HDFS stores its metadata and application data separately. Let's understand its architecture.
HDFS, being a distributed file system, has the following major objectives to satisfy to be effective:
Handling large chunks of data
High availability, and handling hardware failures seamlessly
Streaming access to its data
Scalability to perform better with addition of hardware
Durability with no loss of data in spite of failures
Portability across different types of hardware/software
Data partitioning across multiple nodes in a cluster
HDFS satisfies most of these goals effectively. The following diagram depicts the system architecture of HDFS. Let's understand each of the components in detail.
All the metadata related to HDFS is stored on NameNode. Besides storing metadata, NameNode is the master node which performs coordination activities among data nodes such as data replication across data nodes, naming system such as filenames, their disk locations, and so on. NameNode stores the mapping of blocks to the DataNodes. In a Hadoop cluster, there can only be one single active NameNode. NameNode regulates access to its file system with the use of HDFS based APIs to create, open, edit, and delete HDFS files. The data structure for storing file information is inspired from a UNIX-like filesystem. Each block is indexed, and its index node (inode) mapping is available in memory (RAM) for faster access. NameNode is a multithreaded process and can serve multiple clients at a time.
Any transaction first gets recorded in journal, and the journal file, after completion is flushed and response is sent back to the client. If there is any error while flushing journal to disk, NameNode simply excludes that storage, and moves on with another. NameNode shuts itself down in case no storage directory is available.
Note
Safe mode: When a cluster is started, NameNode starts its complete functionality only when configured minimum percentage of block satisfies the minimum replication. Otherwise, it goes into safe mode. When NameNode is in safe mode state, it does not allow any modification to its file systems. This can be turned off manually by running the following command:
$ hadoop dfsadmin – safemode leave
DataNodes are nothing but slaves that are deployed on all the nodes in a Hadoop cluster. DataNode is responsible for storing the application's data. Each uploaded data file in HDFS is split into multiple blocks, and these data blocks are stored on different data nodes. Default file block size in HDFS is 64 MB. Each Hadoop file block is mapped to two files in data node, one file is the file block data, while the other one is checksum.
When Hadoop is started, each data node connects to NameNode informing its availability to serve the requests. When system is started, the namespace ID and software versions are verified by NameNode, and DataNode sends block report describing what all data blocks it holds to NameNode on startup. During runtime, each DataNode periodically sends NameNode a heartbeat signal, confirming its availability. The default duration between two heartbeats is 3 seconds. NameNode assumes unavailability of DataNode if it does not receive a heartbeat in 10 minutes by default; in that case NameNode does replication of data blocks of that DataNode to other DataNodes. Heartbeat carries information about disk space available, in-use space, data transfer load, and so on. Heartbeat provides primary handshaking across NameNode and DataNode; based on heartbeat information, NameNode chooses next block storage preference, thus balancing the load in the cluster. NameNode effectively uses heartbeat replies to communicate to DataNode regarding block replication to other DataNodes, removal of any blocks, requests for block reports, and so on.
Hadoop runs with single NameNode, which in turn causes it to be a single point of failure for the cluster. To avoid this issue, and to create backup for primary NameNode, the concept of secondary NameNode was introduced recently in the Hadoop framework. While NameNode is busy serving request to various clients, secondary NameNode looks after maintaining a copy of up-to-date memory snapshot of NameNode. These are also called checkpoints.
Secondary NameNode usually runs on a different node other than NameNode, this ensures durability of NameNode. In addition to secondary NameNode, Hadoop also supports CheckpointNode, which creates period checkpoints instead of running a sync of memory with NameNode. In case of failure of NameNode, the recovery is possible up to the last checkpoint snapshot taken by CheckpointNode.
Hadoop distributed file system supports traditional hierarchy based file system (such as UNIX), where user can create their own home directories, subdirectories, and store files in these directories. It allows users to create, rename, move, and delete files as well as directories. There is a root directory denoted with slash (/), and all subdirectories can be created under this root directory, for example /user/foo
.
Note
The default data replication factor on HDFS is three; however one can change this by modifying HDFS configuration files.
Data is organized in multiple data blocks, each comprising 64 MB size by default. Any new file created on HDFS first goes through a stage, where this file is cached on local storage until it reaches the size of one block, and then the client sends a request to NameNode. NameNode, looking at its load on DataNodes, sends information about destination block location and node ID to the client, then client flushes the data to the targeted DataNodes the from local file. In case of unflushed data, if the client flushes the file, the same is sent to DataNode for storage. The data is replicated at multiple nodes through a replication pipeline.
HDFS can be accessed in the following different ways:
Java APIs
Hadoop command line APIs (FS shell)
C/C++ language wrapper APIs
WebDAV (work in progress)
DFSAdmin (command set for administration)
RESTful APIs for HDFS
Similarly, to expose HDFS APIs to rest of the language stacks, there is a separate project called HDFS-APIs (http://wiki.apache.org/hadoop/HDFS-APIs), based on the Thrift framework which allows scalable cross-language service APIs to Perl, Python, Ruby, and PHP. Let's look at the supported operations with HDFS.
Hadoop operations |
Syntax |
Example |
---|---|---|
|
| |
Importing file from local file store |
|
|
Exporting file to local file store |
|
|
Opening and reading a file |
|
|
Copy files in Hadoop |
|
|
Moving or renaming a file or directory |
|
|
Delete a file or directory, recursive delete |
|
|
Get status of file or directory, size, other information |
|
|
List a file or directory |
|
|
Get different attributes of file/directory |
|
|
|
| |
Set owner for file/directory |
|
|
Setting replication factor |
|
|
Change group permissions with file |
|
|
Getting the count of files and directories |
|
|