Setting up a Hadoop cluster is a step-by-step process. It is recommended to start with a single node setup and then extend it to the cluster mode. Apache Hadoop can be installed with three different types of setup:
Single node setup: In this mode, Hadoop can be set up on a single standalone machine. This mode is used by developers for evaluation, testing, basic development, and so on.
Pseudo distributed setup: Apache Hadoop can be set up on a single machine with a distributed configuration. In this setup, Apache Hadoop can run with multiple Hadoop processes (daemons) on the same machine. Using this mode, developers can do the testing for a distributed setup on a single machine.
Fully distributed setup: In this mode, Apache Hadoop is set up on a cluster of nodes, in a fully distributed manner. Typically, production-level setups use this mode for actively using the Hadoop computing capabilities.
Note
In Linux, Apache Hadoop can be set up through the root user, which makes it globally available, or as a separate user, which makes it available to only that user (Hadoop user), and the access can later be extended for other users. It is better to use a separate user with limited privileges to ensure that the Hadoop runtime does not have any impact on the running system.
Before setting up a Hadoop cluster, it is important to ensure that all prerequisites are addressed. Hadoop runs on the following operating systems:
All Linux Flavors are supported for development as well as production.
In the case of Windows, Microsoft Windows 2008 onwards are supported. Apache Hadoop version 2.2 onwards support Windows. The older versions of Hadoop have limited support through Cygwin.
Apache Hadoop requires the following software:
Java 1.6 onwards are all supported; however, there are compatibility issues, so it is best to look at Hadoop's Java compatibility wiki page at http://wiki.apache.org/hadoop/HadoopJavaVersions.
Secure shell (ssh) is needed to run start, stop, status, or other such scripts across a cluster. You may also consider using parallel-ssh (more information is available at https://code.google.com/p/parallel-ssh/) for connectivity.
Apache Hadoop can be downloaded from http://www.apache.org/dyn/closer.cgi/Hadoop/common/. Make sure that you download and choose the correct release from different releases, that is, one that is a stable release, the latest beta/alpha release, or a legacy stable version. You can choose to download the package or download the source, compile it on your OS, and then install it. Using operating system package installer, install the Hadoop package. This software can be installed directly by using apt-get/dpkg
for Ubuntu/Debian or rpm
for Red Hat/Oracle Linux from the respective sites. In the case of a cluster setup, this software should be installed on all the machines.
Apache Hadoop uses ssh to run its scripts on different nodes, it is important to make this ssh login happen without any prompt for password. If you already have a key generated, then you can skip this step. To make ssh work without a password, run the following commands:
$ ssh-keygen -t dsa
You can also use RSA-based encryption algorithm (link to know about RSA: http://en.wikipedia.org/wiki/RSA_%28cryptosystem%29) instead of DSA (Digital Signature Algorithm) for your ssh authorization key creation. (For more information about differences between these two algorithms, visit http://security.stackexchange.com/questions/5096/rsa-vs-dsa-for-ssh-authentication-keys. Keep the default file for saving the key, and do not enter a passphrase. Once the key generation is successfully complete, the next step is to authorize the key by running the following command:
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
This step will actually create an authorization key with ssh, bypassing the passphrase check as shown in the following screenshot:
Once this step is complete, you can ssh localhost
to connect to your instance without password. If you already have a key generated, you will get a prompt to overwrite it; in such a case, you can choose to overwrite it or you can use the existing key and put it in the authorized_keys
file.
Most of the Hadoop configuration is specified in the following configuration files, kept in the $HADOOP_HOME/etc/Hadoop
folder of the installation. $HADOOP_HOME
is the place where Apache Hadoop has been installed. If you have installed the software by using the pre-build package installer as the root user, the configuration can be found at /etc/Hadoop
.
The file names marked in pink italicized letters will be modified while setting up your basic Hadoop cluster.
Now, let's start with the configuration of these files for the first Hadoop run. Open core-sites.xml
, and add the following entry in it:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
This snippet tells the Hadoop framework to run inter-process communication on port 9000. Next, edit hdfs-site.xml
and add the following entries:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
This tells HDFS to have the distributed file system's replication factor as 1. Later when you run Hadoop in the cluster configuration, you can change this replication count. The choice of replication factor varies from case to case, but if you are not sure about it, it is better to keep it as 3. This means that each document will have a replication of factor of 3.
Let's start looking at the MapReduce configuration. Some applications such as Apache HBase use only HDFS for storage, and they do not rely on the MapReduce framework. This means that all they require is the HDFS configuration, and the next configuration can be skipped.
Now, edit mapred-site.xml
and add the following entries:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
This entry points to YARN as the MapReduce framework used. Further, modify yarn-site.xml
with the following entries:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
This entry enables YARN to use the ShuffleHandler
service with nodemanager
. Once the configuration is complete, we are good to start the Hadoop. Here are the default ports used by Apache Hadoop:
Particular |
Default Port |
---|---|
HDFS Port |
9000/8020 |
NameNode – Web Application |
50070 |
Data Node |
50075 |
Secondary NameNode |
50090 |
Resource Manager Web Application |
8088 |