Scaling Big Data with Hadoop and Solr, Second Edition

Setting up a Hadoop cluster is a step-by-step process. It is recommended to start with a single node setup and then extend it to the cluster mode. Apache Hadoop can be installed with three different types of setup:

Single node setup: In this mode, Hadoop can be set up on a single standalone machine. This mode is used by developers for evaluation, testing, basic development, and so on.
Pseudo distributed setup: Apache Hadoop can be set up on a single machine with a distributed configuration. In this setup, Apache Hadoop can run with multiple Hadoop processes (daemons) on the same machine. Using this mode, developers can do the testing for a distributed setup on a single machine.
Fully distributed setup: In this mode, Apache Hadoop is set up on a cluster of nodes, in a fully distributed manner. Typically, production-level setups use this mode for actively using the Hadoop computing capabilities.

Note

In Linux, Apache Hadoop can be set up through the root user, which makes it globally available, or as a separate user, which makes it available to only that user (Hadoop user), and the access can later be extended for other users. It is better to use a separate user with limited privileges to ensure that the Hadoop runtime does not have any impact on the running system.

Prerequisites

Before setting up a Hadoop cluster, it is important to ensure that all prerequisites are addressed. Hadoop runs on the following operating systems:

All Linux Flavors are supported for development as well as production.
In the case of Windows, Microsoft Windows 2008 onwards are supported. Apache Hadoop version 2.2 onwards support Windows. The older versions of Hadoop have limited support through Cygwin.

Apache Hadoop requires the following software:

Java 1.6 onwards are all supported; however, there are compatibility issues, so it is best to look at Hadoop's Java compatibility wiki page at http://wiki.apache.org/hadoop/HadoopJavaVersions.
Secure shell (ssh) is needed to run start, stop, status, or other such scripts across a cluster. You may also consider using parallel-ssh (more information is available at https://code.google.com/p/parallel-ssh/) for connectivity.

Apache Hadoop can be downloaded from http://www.apache.org/dyn/closer.cgi/Hadoop/common/. Make sure that you download and choose the correct release from different releases, that is, one that is a stable release, the latest beta/alpha release, or a legacy stable version. You can choose to download the package or download the source, compile it on your OS, and then install it. Using operating system package installer, install the Hadoop package. This software can be installed directly by using apt-get/dpkg for Ubuntu/Debian or rpm for Red Hat/Oracle Linux from the respective sites. In the case of a cluster setup, this software should be installed on all the machines.

Setting up ssh without passphrase

Apache Hadoop uses ssh to run its scripts on different nodes, it is important to make this ssh login happen without any prompt for password. If you already have a key generated, then you can skip this step. To make ssh work without a password, run the following commands:

$ ssh-keygen -t dsa

You can also use RSA-based encryption algorithm (link to know about RSA: http://en.wikipedia.org/wiki/RSA_%28cryptosystem%29) instead of DSA (Digital Signature Algorithm) for your ssh authorization key creation. (For more information about differences between these two algorithms, visit http://security.stackexchange.com/questions/5096/rsa-vs-dsa-for-ssh-authentication-keys. Keep the default file for saving the key, and do not enter a passphrase. Once the key generation is successfully complete, the next step is to authorize the key by running the following command:

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

This step will actually create an authorization key with ssh, bypassing the passphrase check as shown in the following screenshot:

Once this step is complete, you can ssh localhost to connect to your instance without password. If you already have a key generated, you will get a prompt to overwrite it; in such a case, you can choose to overwrite it or you can use the existing key and put it in the authorized_keys file.

Configuring Hadoop

Most of the Hadoop configuration is specified in the following configuration files, kept in the $HADOOP_HOME/etc/Hadoop folder of the installation. $HADOOP_HOME is the place where Apache Hadoop has been installed. If you have installed the software by using the pre-build package installer as the root user, the configuration can be found at /etc/Hadoop.

File Name	Description
`core-site.xml`	In this file, you can modify the default properties of Hadoop. This covers setting up different protocols for interaction, working directories, log management, security, buffers and blocks, temporary files, and so on.
`hdfs-site.xml`	This file stores the entire configuration related to HDFS. So, properties like DFS site address, data directory, replication factors, and so on are covered in these files.
`mapred-site.xml`	This file is responsible for handling the entire configuration related to the MapReduce framework. This covers the configuration for JobTracker and TaskTracker properties for Job.
`yarn-site.xml`	This file is required for managing YARN-related configuration. This configuration typically contains security/access information, proxy configuration, resource manager configuration, and so on.
`httpfs-site.xml`	Hadoop supports REST-based data transfer between clusters through an HttpFS server. This file is responsible for storing configuration related to the HttpFS server.
`fair-scheduler.xml`	This file contains information about user allocations and pooling information for the fair scheduler. It is currently under development.
`capacity-scheduler.xml`	This file is mainly used by the RM in Hadoop for setting up the scheduling parameters of job queues.
`Hadoop-env.sh` or `Hadoop-env.cmd`	All the environment variables are defined in this file; you can change any of the environments: namely the Java location, Hadoop configuration directory, and so on.
`mapred-env.sh` or `mapred-env.cmd`	This file contains the environment variables used by Hadoop while running MapReduce.
`yarn-env.sh or yarn-env.cmd`	This file contains the environment variables used by the YARN daemon that starts/stops the node manager and the RM.
`httpfs-env.sh` or `httpfs-env.cmd`	This file contains environment variables required by the HttpFS server.
`Hadoop-policy.xml`	This file is used to define various access control lists for Hadoop services. It controls who can use the Hadoop cluster for execution.
`Masters/slaves`	In this file, you can define the hostname for the masters and the slaves. The masters file lists all the masters, and the slaves file lists the slave nodes. To run Hadoop in the cluster mode, you need to modify these files to point to the respective master and slaves on all nodes.
`log4j.properties`	You can define various log levels for your instance; this is helpful while developing or debugging Hadoop programs. You can define levels for logging.
`common-logging.properties`	This file specifies the default logger used by Hadoop; you can override it to use your logger.

The file names marked in pink italicized letters will be modified while setting up your basic Hadoop cluster.

Now, let's start with the configuration of these files for the first Hadoop run. Open core-sites.xml, and add the following entry in it:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

This snippet tells the Hadoop framework to run inter-process communication on port 9000. Next, edit hdfs-site.xml and add the following entries:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

This tells HDFS to have the distributed file system's replication factor as 1. Later when you run Hadoop in the cluster configuration, you can change this replication count. The choice of replication factor varies from case to case, but if you are not sure about it, it is better to keep it as 3. This means that each document will have a replication of factor of 3.

Let's start looking at the MapReduce configuration. Some applications such as Apache HBase use only HDFS for storage, and they do not rely on the MapReduce framework. This means that all they require is the HDFS configuration, and the next configuration can be skipped.

Now, edit mapred-site.xml and add the following entries:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

This entry points to YARN as the MapReduce framework used. Further, modify yarn-site.xml with the following entries:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

This entry enables YARN to use the ShuffleHandler service with nodemanager. Once the configuration is complete, we are good to start the Hadoop. Here are the default ports used by Apache Hadoop:

Particular	Default Port
HDFS Port	9000/8020
NameNode – Web Application	50070
Data Node	50075
Secondary NameNode	50090
Resource Manager Web Application	8088

Scaling Big Data with Hadoop and Solr, Second Edition

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr, Second Edition

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr, Second Edition

Configuring Apache Hadoop

Note

Prerequisites

Setting up ssh without passphrase

Configuring Hadoop