Book Image

Elasticsearch for Hadoop

By : Vishal Shukla
Book Image

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Table of Contents (15 chapters)
Elasticsearch for Hadoop
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Setting up Hadoop for Elasticsearch


For our exploration on Hadoop and Elasticsearch, we will use an Ubuntu-based host. However, you may opt to run any other Linux OS and set up Hadoop and Elasticsearch.

Being a Hadoop user, if you already have Hadoop set up in your local machine, you may jump directly to the section, Setting up Elasticsearch.

Hadoop supports three cluster modes: the stand-alone mode, the pseudo-distributed mode, and the fully-distributed mode. To make it good enough to walk through the examples of the book, we will consider the pseudo-distributed mode on a Linux operating system. This mode will ensure that without getting into the complexity of setting up so many nodes, we will mirror the components in such a way that they behave no differently to the real production environment. In the pseudo-distributed mode, each component runs on its own JVM process.

Setting up Java

The examples in this book are developed and tested against Oracle Java 1.8. These examples should run fine with other distributions of Java 8 as well.

In order to set up Oracle Java 8, open the terminal and execute the following steps:

  1. First, add and update the repository for Java 8 with the following command:

    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    
  2. Next, install Java 8 and configure the environment variables, as shown in the following command:

    $ sudo apt-get install oracle-java8-set-default
    
  3. Now, verify the installation as follows:

    $ java -version
    

    This should show an output similar to the following code; it may vary a bit based on the exact version:

    java version "1.8.0_60"
    Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
    Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
    

Setting up a dedicated user

To ensure that our ES-Hadoop environment is clean and isolated from the rest of the applications and to be able to manage security and permissions easily, we will set up a dedicated user. Perform the following steps:

  1. First, add the hadoop group with the following command:

    $ sudo addgroup hadoop
    
  2. Then, add the eshadoop user to the hadoop group, as shown in the following command:

    $ sudo adduser eshadoop hadoop
    
  3. Finally, add the eshadoop user to the sudoers list by adding the user to the sudo group as follows:

    $ sudo adduser eshadoop sudo
    

Now, you need to relogin with the eshadoop user to execute further steps.

Installing SSH and setting up the certificate

In order to manage nodes, Hadoop requires an SSH access, so let's install and run the SSH. Perform the following steps:

  1. First, install ssh with the following command:

    $ sudo apt-get install ssh 
    
  2. Then, generate a new SSH key pair using the ssh-keygen utility, by using the following command:

    $ ssh-keygen -t rsa -P ''  -C [email protected]
    

    Note

    You must use the default settings when asked for Enter file in which to save the key. By default, it should generate the key pair under the /home/eshadoop/.ssh folder.

  3. Now, confirm the key generation by issuing the following command. This command should display at least a couple of files with id_rsa and id_rsa.pub. We just created an RSA key pair with an empty password so that Hadoop can interact with nodes without the need to enter the passphrase:

    $ ls -l ~/.ssh
    
  4. To enable the SSH access to your local machine, you need to specify that the newly generated public key is an authorized key to log in using the following command:

    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    
  5. Finally, do not forget to test the password-less ssh using the following command:

    $ ssh localhost
    

Downloading Hadoop

Using the following commands, download Hadoop and extract the file to /usr/local so that it is available for other users as well. Perform the following steps:

  1. First, download the Hadoop tarball by running the following command:

    $ wget http://ftp.wayne.edu/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
    
  2. Next, extract the tarball to the /usr/local directory with the following command:

    $ sudo tar vxzf hadoop-2.6.0.tar.gz -C /usr/local
    

    Note

    Note that extracting it to /usr/local will affect other users as well. In other words, it will be available to other users as well, assuming that appropriate permissions are provided for the directory.

  3. Now, rename the Hadoop directory using the following command:

    $ cd /usr/local
    $ sudo mv hadoop-2.6.0 hadoop
    
  4. Finally, change the owner of all the files to the eshadoop user and the hadoop group with the following command:

    $ sudo chown -R eshadoop:hadoop hadoop
    

Setting up environment variables

The next step is to set up environment variables. You can do so by exporting the required variables to the .bashrc file for the user.

Open the .bashrc file using any editor of your choice, then add the following export declarations to set up our environment variables:

#Set JAVA_HOME 
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

#Set Hadoop related environment variable
export HADOOP_INSTALL=/usr/local/hadoop

#Add bin and sbin directory to PATH
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin

#Set few more Hadoop related environment variable
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Once you have saved the .bashrc file, you can relogin to have your new environment variables visible, or you can source the .bashrc file using the following command:

$ source ~/.bashrc

Configuring Hadoop

Now, we need to set up the JAVA_HOME environment variable in the hadoop-env.sh file that is used by Hadoop. You can find it in $HADOOP_INSTALL/etc/hadoop.

Next, change the JAVA_HOME path to reflect to your Java installation directory. On my machine, it looks similar to the following:

$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Now, let's relogin and confirm the configuration using the following command:

$ hadoop version

As you know, we will set up our Hadoop environment in a pseudo-distributed mode. In this mode, each Hadoop daemon runs in a separate Java process. The next step is to configure these daemons. So, let's switch to the following folder that contains all the Hadoop configuration files:

$ cd $HADOOP_INSTALL/etc/hadoop

Configuring core-site.xml

The configuration of core-site.xml will set up the temporary directory for Hadoop and the default filesystem. In our case, the default filesystem refers to the NameNode. Let's change the content of the <configuration> section of core-site.xml so that it looks similar to the following code:

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/eshadoop/hdfs/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
</configuration>

Configuring hdfs-site.xml

Now, we will configure the replication factor for HDFS files. To set the replication to 1, change the content of the <configuration> section of hdfs-site.xml so that it looks similar to the following code:

<configuration>
   <property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
</configuration>

Note

We will run Hadoop in the pseudo-distributed mode. In order to do this, we need to configure the YARN resource manager. YARN handles the resource management and scheduling responsibilities in the Hadoop cluster so that the data processing and data storage components can focus on their respective tasks.

Configuring yarn-site.xml

Configure yarn-site.xml in order to configure the auxiliary service name and classes, as shown in the following code:

<configuration>
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Configuring mapred-site.xml

Hadoop provides mapred-site.xml.template, which you can rename to mapred-site.xml and change the content of the <configuration> section to the following code; this will ensure that the MapReduce jobs run on YARN as opposed to running them in-process locally:

<configuration>  
  <property>
    <name>mapred.job.tracker</name>
    <value>yarn</value>
  </property>
</configuration>

The format distributed filesystem

We have already configured all the Hadoop daemons, including HDFS, YARN, and the JobTracker. You may already be aware that HDFS relies on NameNode and DataNodes. NameNode contains the storage-related metadata, whereas DataNode stores the real data in the form of blocks. When you set up your Hadoop cluster, it is required to format NameNode before you can start using HDFS. We can do so with the following command:

$ hadoop namenode -format

Note

If you were already using the data nodes of HDFS, do not format the name node unless you know what you are doing. When you format NameNode, you will lose all the storage metadata, just as the blocks are distributed among DataNodes. This means that although you didn't physically remove the data from DataNodes, the data will be inaccessible to you. Therefore, it is always good to remove the data in DataNodes when you format the NameNode.

Starting Hadoop daemons

Now, we have all the prerequisites set up along with all the Hadoop daemons. In order to run our first MapReduce job, we need all the required Hadoop daemons running.

Let's start with HDFS using the following command. This command starts the NameNode, SecondaryNameNode, and DataNode daemons:

$ start-dfs.sh

The next step is to start the YARN resource manager using the following command (YARN will start the ResourceManager and NodeManager daemons):

$ start-yarn.sh

If the preceding two commands were successful in starting HDFS and YARN, you should be able to check the running daemons using the jps tool (this tool lists the running JVM process on your machine):

$ jps

If everything worked successfully, you should see the following services running:

13386 SecondaryNameNode
13059 NameNode
13179 DataNode
17490 Jps
13649 NodeManager
13528 ResourceManager