For our exploration on Hadoop and Elasticsearch, we will use an Ubuntu-based host. However, you may opt to run any other Linux OS and set up Hadoop and Elasticsearch.
Being a Hadoop user, if you already have Hadoop set up in your local machine, you may jump directly to the section, Setting up Elasticsearch.
Hadoop supports three cluster modes: the stand-alone mode, the pseudo-distributed mode, and the fully-distributed mode. To make it good enough to walk through the examples of the book, we will consider the pseudo-distributed mode on a Linux operating system. This mode will ensure that without getting into the complexity of setting up so many nodes, we will mirror the components in such a way that they behave no differently to the real production environment. In the pseudo-distributed mode, each component runs on its own JVM process.
The examples in this book are developed and tested against Oracle Java 1.8. These examples should run fine with other distributions of Java 8 as well.
In order to set up Oracle Java 8, open the terminal and execute the following steps:
First, add and update the repository for Java 8 with the following command:
$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update
Next, install Java 8 and configure the environment variables, as shown in the following command:
$ sudo apt-get install oracle-java8-set-default
Now, verify the installation as follows:
$ java -version
This should show an output similar to the following code; it may vary a bit based on the exact version:
java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
To ensure that our ES-Hadoop environment is clean and isolated from the rest of the applications and to be able to manage security and permissions easily, we will set up a dedicated user. Perform the following steps:
First, add the
hadoop
group with the following command:$ sudo addgroup hadoop
Then, add the
eshadoop
user to thehadoop
group, as shown in the following command:$ sudo adduser eshadoop hadoop
Finally, add the
eshadoop
user to thesudoers
list by adding the user to thesudo
group as follows:$ sudo adduser eshadoop sudo
Now, you need to relogin with the eshadoop
user to execute further steps.
In order to manage nodes, Hadoop requires an SSH access, so let's install and run the SSH. Perform the following steps:
First, install
ssh
with the following command:$ sudo apt-get install ssh
Then, generate a new SSH key pair using the
ssh-keygen
utility, by using the following command:$ ssh-keygen -t rsa -P '' -C [email protected]
Now, confirm the key generation by issuing the following command. This command should display at least a couple of files with
id_rsa
andid_rsa.pub
. We just created an RSA key pair with an empty password so that Hadoop can interact with nodes without the need to enter the passphrase:$ ls -l ~/.ssh
To enable the SSH access to your local machine, you need to specify that the newly generated public key is an authorized key to log in using the following command:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Finally, do not forget to test the password-less
ssh
using the following command:$ ssh localhost
Using the following commands, download Hadoop and extract the file to /usr/local
so that it is available for other users as well. Perform the following steps:
First, download the Hadoop tarball by running the following command:
$ wget http://ftp.wayne.edu/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Next, extract the tarball to the
/usr/local
directory with the following command:$ sudo tar vxzf hadoop-2.6.0.tar.gz -C /usr/local
Now, rename the Hadoop directory using the following command:
$ cd /usr/local $ sudo mv hadoop-2.6.0 hadoop
Finally, change the owner of all the files to the
eshadoop
user and thehadoop
group with the following command:$ sudo chown -R eshadoop:hadoop hadoop
The next step is to set up environment variables. You can do so by exporting the required variables to the .bashrc
file for the user.
Open the .bashrc
file using any editor of your choice, then add the following export declarations to set up our environment variables:
#Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-8-oracle #Set Hadoop related environment variable export HADOOP_INSTALL=/usr/local/hadoop #Add bin and sbin directory to PATH export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin #Set few more Hadoop related environment variable export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
Once you have saved the .bashrc
file, you can relogin to have your new environment variables visible, or you can source the .bashrc
file using the following command:
$ source ~/.bashrc
Now, we need to set up the JAVA_HOME
environment variable in the hadoop-env.sh
file that is used by Hadoop. You can find it in $HADOOP_INSTALL/etc/hadoop
.
Next, change the JAVA_HOME
path to reflect to your Java installation directory. On my machine, it looks similar to the following:
$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle
Now, let's relogin and confirm the configuration using the following command:
$ hadoop version
As you know, we will set up our Hadoop environment in a pseudo-distributed mode. In this mode, each Hadoop daemon runs in a separate Java process. The next step is to configure these daemons. So, let's switch to the following folder that contains all the Hadoop configuration files:
$ cd $HADOOP_INSTALL/etc/hadoop
The configuration of core-site.xml
will set up the temporary directory for Hadoop and the default filesystem. In our case, the default filesystem refers to the NameNode. Let's change the content of the <configuration>
section of core-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/eshadoop/hdfs/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Now, we will configure the replication factor for HDFS files. To set the replication to 1
, change the content of the <configuration>
section of hdfs-site.xml
so that it looks similar to the following code:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Note
We will run Hadoop in the pseudo-distributed mode. In order to do this, we need to configure the YARN resource manager. YARN handles the resource management and scheduling responsibilities in the Hadoop cluster so that the data processing and data storage components can focus on their respective tasks.
Configure yarn-site.xml
in order to configure the auxiliary service name and classes, as shown in the following code:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
Hadoop provides mapred-site.xml.template
, which you can rename to mapred-site.xml
and change the content of the <configuration>
section to the following code; this will ensure that the MapReduce jobs run on YARN as opposed to running them in-process locally:
<configuration> <property> <name>mapred.job.tracker</name> <value>yarn</value> </property> </configuration>
We have already configured all the Hadoop daemons, including HDFS, YARN, and the JobTracker. You may already be aware that HDFS relies on NameNode and DataNodes. NameNode contains the storage-related metadata, whereas DataNode stores the real data in the form of blocks. When you set up your Hadoop cluster, it is required to format NameNode before you can start using HDFS. We can do so with the following command:
$ hadoop namenode -format
Note
If you were already using the data nodes of HDFS, do not format the name node unless you know what you are doing. When you format NameNode, you will lose all the storage metadata, just as the blocks are distributed among DataNodes. This means that although you didn't physically remove the data from DataNodes, the data will be inaccessible to you. Therefore, it is always good to remove the data in DataNodes when you format the NameNode.
Now, we have all the prerequisites set up along with all the Hadoop daemons. In order to run our first MapReduce job, we need all the required Hadoop daemons running.
Let's start with HDFS using the following command. This command starts the NameNode, SecondaryNameNode, and DataNode daemons:
$ start-dfs.sh
The next step is to start the YARN resource manager using the following command (YARN will start the ResourceManager and NodeManager daemons):
$ start-yarn.sh
If the preceding two commands were successful in starting HDFS and YARN, you should be able to check the running daemons using the jps
tool (this tool lists the running JVM process on your machine):
$ jps
If everything worked successfully, you should see the following services running:
13386 SecondaryNameNode 13059 NameNode 13179 DataNode 17490 Jps 13649 NodeManager 13528 ResourceManager