-
Book Overview & Buying
-
Table Of Contents
Hadoop MapReduce v2 Cookbook - Second Edition: RAW - Second Edition
HDFS is a block structured distributed filesystem that is designed to store petabytes of data reliably on top of clusters made out of commodity hardware. HDFS supports storing massive amounts of data and provides high throughput access to the data. HDFS stores file data across multiple nodes with redundancy to ensure fault-tolerance and high aggregate bandwidth.
HDFS is the default distributed filesystem used by the Hadoop MapReduce computations. Hadoop supports data locality aware processing of data stored in HDFS. HDFS architecture consists mainly of a centralized NameNode that handles the filesystem metadata and DataNodes that store the real data blocks. HDFS data blocks are typically coarser grained and perform better with large streaming reads.
To set up HDFS, we first need to configure a NameNode and DataNodes, and then specify the DataNodes in the slaves file. When we start the NameNode, the startup script will start the DataNodes.
Installing HDFS directly using Hadoop release artifacts as mentioned in this recipe is recommended for development testing and for advanced use cases only. For regular production clusters, we recommend using a packaged Hadoop distribution as mentioned in the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe. Packaged Hadoop distributions make it much easier to install, configure, maintain, and update the components of the Hadoop ecosystem.
You can follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the name node as well as the DataNode.
JAVA_HOME environment variable to point to the Java installation.Now let's set up HDFS in the distributed mode:
$ ssh localhost $ ssh <IPaddress>
Configuring password-less SSH
If the command in step 1 returns an error or asks for a password, create SSH keys by executing the following command (you may have to manually enable SSH beforehand depending on your OS):
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Move the ~/.ssh/id_dsa.pub file to all the nodes in the cluster. Then add the SSH keys to the ~/.ssh/authorized_keys file in each node by running the following command (if the authorized_keys file does not exist, run the following command. Otherwise, skip to the cat command):
$ touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
Now with permissions set, add your key to the ~/.ssh/authorized_keys file:
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Then you should be able to execute the following command successfully, without providing a password:
$ ssh localhost
{HADOOP_DATA_DIR}. Create two subdirectories inside the data directory as {HADOOP_DATA_DIR}/data and {HADOOP_DATA_DIR}/name. Change the directory permissions to 755 by running the following command for each directory:$ chmod –R 755 <HADOOP_DATA_DIR>
{HADOOP_HOME}/etc/hadoop/slaves file. When we start the NameNode, it will use this slaves file to start the DataNodes.{HADOOP_HOME}/etc/hadoop/core-site.xml. Before adding the configurations, replace the {NAMENODE} strings with the IP of the master node:<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{NAMENODE}:9000/</value>
</property>
</configuration>{HADOOP_HOME}/etc/hadoop/hdfs-site.xml files in the {HADOOP_HOME}/etc/hadoop directory. Before adding the configurations, replace the {HADOOP_DATA_DIR} with the directory you created in the first step. Replicate the core-site.xml and hdfs-site.xml files we modified in steps 4 and 5 to all the nodes.<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<!-- Path to store namespace and transaction logs -->
<value>{HADOOP_DATA_DIR}/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<!-- Path to store data blocks in datanode -->
<value>{HADOOP_DATA_DIR}/data</value>
</property>
</configuration>$ $HADOOP_HOME/bin/hdfs namenode –format
You will see the following line in the output after the successful completion of the previous command:
… 13/04/09 08:44:51 INFO common.Storage: Storage directory /…/dfs/name has been successfully formatted. ….
$ $HADOOP_HOME/sbin/start-dfs.sh
This command will first start a NameNode in the master node. Then it will start the DataNode services in the machines mentioned in the slaves file. Finally, it'll start the secondary NameNode.
http://{NAMENODE}:50070/. Visit the monitoring console and verify whether you can see the HDFS startup page. Here, replace {NAMENODE} with the IP address of the node running the HDFS NameNode.$ $HADOOP_HOME/bin/hadoop dfsadmin -report
$ $HADOOP_HOME/sbin/stop-dfs.sh
Change the font size
Change margin width
Change background colour