Book Image

Big Data Analytics with Hadoop 3

By : Sridhar Alla
Book Image

Big Data Analytics with Hadoop 3

By: Sridhar Alla

Overview of this book

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples. Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases. By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.
Table of Contents (18 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
4
Scientific Computing and Big Data Analysis with Python and Hadoop
Index

Installing Hadoop 3 


In this section, we shall see how to install a single-node Hadoop 3 cluster on your local machine. In order to do this, we will be following the documentation given at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html.

This document gives us a detailed description of how to install and configure a single-node Hadoop setup in order to carry out simple operations using Hadoop MapReduce and the HDFS quickly.

Prerequisites

Java 8 must be installed for Hadoop to be run. If Java 8 does not exist on your machine, then you can download and install Java 8: https://www.java.com/en/download/.

The following will appear on your screen when you open the download link in the browser:

Downloading

Download the Hadoop 3.1 version using the following link: http://apache.spinellicreations.com/hadoop/common/hadoop-3.1.0/.

The following screenshot is the page shown when the download link is opened in the browser:

When you get this page in your browser, simply download the hadoop-3.1.0.tar.gz file to your local machine.

Installation

Perform the following steps to install a single-node Hadoop cluster on your machine:

  1. Extract the downloaded file using the following command:
tar -xvzf hadoop-3.1.0.tar.gz
  1. Once you have extracted the Hadoop binaries, just run the following commands to test the Hadoop binaries and make sure the binaries works on our local machine:
cd hadoop-3.1.0

mkdir input

cp etc/hadoop/*.xml input

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep input output 'dfs[a-z.]+'

cat output/*

If everything runs as expected, you will see an output directory showing some output, which shows that the sample command worked.

Note

A typical error at this point will be missing Java. You might want to check and see if you have Java installed on your machine and the JAVA_HOME environment variable set correctly.

Setup password-less ssh

Now check if you can ssh to the localhost without a passphrase by running a simple command, shown as follows:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Setting up the NameNode

Make the following changes to the configuration file etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Make the following changes to the configuration file etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
        <name>dfs.name.dir</name>
        <value><YOURDIRECTORY>/hadoop-3.1.0/dfs/name</value>
    </property>
</configuration>

Starting HDFS

Follow these steps as shown to start HDFS (NameNode and DataNode):

  1. Format the filesystem:
$ ./bin/hdfs namenode -format
  1. Start the NameNode daemon and the DataNode daemon:
$ ./sbin/start-dfs.sh

The Hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

  1. Browse the web interface for the NameNode; by default it is available at http://localhost:9870/.
  2. Make the HDFS directories required to execute MapReduce jobs:
$ ./bin/hdfs dfs -mkdir /user 
$ ./bin/hdfs dfs -mkdir /user/<username>
  1. When you're done, stop the daemons with the following:
$ ./sbin/stop-dfs.sh
  1. Open a browser to check your local Hadoop, which can be launched in the browser as http://localhost:9870/. The following is what the HDFS installation looks like:
  1. Clicking on the Datanodes tab shows the nodes as shown in the following screenshot:

Figure: Screenshot showing the nodes in the Datanodes tab

  1. Clicking on the logs will show the various logs in your cluster, as shown in the following screenshot:
  1. As shown in the following screenshot, you can also look at the various JVM metrics of your cluster components:
  1. As shown in the following screenshot, you can also check the configuration. This is a good place to look at the entire configuration and all the default settings:
  1. You can also browse the filesystem of your newly installed cluster, as shown in the following screenshot:

Figure: Screenshot showing the Browse Directory and how you can browse the filesystem in you newly installed cluster

At this point, we should all be able to see and use a basic HDFS cluster. But this is just a HDFS filesystem with some directories and files. We also need a job/task scheduling service to actually use the cluster for computational needs rather than just storage.

Setting up the YARN service

In this section, we will set up a YARN service and start the components needed to run and operate a YARN cluster:

  1. Start the ResourceManager daemon and the NodeManager daemon:

$ sbin/start-yarn.sh
  1. Browse the web interface for the ResourceManager; by default it is available at: http://localhost:8088/

  2. Run a MapReduce job

  3. When you're done, stop the daemons with the following:

$ sbin/stop-yarn.sh

The following is the YARN ResourceManager, which you can view by putting the URL http://localhost:8088/ into the browser:

Figure: Screenshot of YARN ResouceManager

The following is a view showing the queues of resources in the cluster, along with any applications running. This is also the place where you can see and monitor the running jobs:

Figure: Screenshot of queues of resources in the cluster

At this time, we should be able to see the running YARN service in our local cluster running Hadoop 3.1.0. Next, we will look at some new features in Hadoop 3.x.

Erasure Coding

EC is a key change in Hadoop 3.x promising a significant improvement in HDFS utilization efficiencies as compared to earlier versions where replication factor of 3 for instance caused immense wastage of precious cluster file system for all kinds of data no matter what the relative importance was to the tasks at hand. 

EC can be setup using policies and assigning the policies to directories in HDFS. For this, HDFS provides an ec subcommand to perform administrative commands related to EC:

hdfs ec [generic options]
    [-setPolicy -path <path> [-policy <policyName>] [-replicate]]
    [-getPolicy -path <path>]
    [-unsetPolicy -path <path>]
    [-listPolicies]
    [-addPolicies -policyFile <file>]
    [-listCodecs]
    [-enablePolicy -policy <policyName>]
    [-disablePolicy -policy <policyName>]
    [-help [cmd ...]]

The following are the details of each command:

  • [-setPolicy -path <path> [-policy <policyName>] [-replicate]]: Sets an EC policy on a directory at the specified path.
    • path: An directory in HDFS. This is a mandatory parameter. Setting a policy only affects newly created files, and does not affect existing files.
    • policyName: The EC policy to be used for files under this directory. This parameter can be omitted if a  dfs.namenode.ec.system.default.policy configuration is set. The EC policy of the path will be set with the default value in configuration.
    • -replicate: Apply the special REPLICATION policy on the directory, force the directory to adopt 3x replication scheme.
    • -replicate and -policy <policyName>: These are optional arguments. They cannot be specified at the same time.
  • [-getPolicy -path <path>]: Get details of the EC policy of a file or directory at the specified path.
  • [-unsetPolicy -path <path>]: Unset an EC policy set by a previous call to setPolicy on a directory. If the directory inherits the EC policy from an ancestor directory, unsetPolicy is a no-op. Unsetting the policy on a directory which doesn't have an explicit policy set will not return an error.
  • [-listPolicies]: Lists all (enabled, disabled and removed) EC policies registered in HDFS. Only the enabled policies are suitable for use with the setPolicy command.
  • [-addPolicies -policyFile <file>]: Add a list of EC policies. Please referetc/hadoop/user_ec_policies.xml.template for the example policy file. The maximum cell size is defined in propertydfs.namenode.ec.policies.max.cellsize with the default value 4 MB. Currently HDFS allows the user to add 64 policies in total, and the added policy ID is in range of 64 to 127. Adding policy will fail if there are already 64 policies added.
  • [-listCodecs]: Get the list of supported EC codecs and coders in system. A coder is an implementation of a codec. A codec can have different implementations, thus different coders. The coders for a codec are listed in a fall back order.
  • [-removePolicy -policy <policyName>]: It removes an EC policy
  • [-enablePolicy -policy <policyName>]: It enables an EC policy
  • [-disablePolicy -policy <policyName>]: It disables an EC policy

By using -listPolicies, you can list all the EC policies currently setup in your cluster along with the state of such policies whether they are ENABLED or DISABLED:

Lets test out EC in our cluster. First we will create directories in the HDFS shown as follows:

./bin/hdfs dfs -mkdir /user/normal
./bin/hdfs dfs -mkdir /user/ec

Once the two directories are created then you can set the policy on any path:

./bin/hdfs ec -setPolicy -path /user/ec -policy RS-6-3-1024k
Set RS-6-3-1024k erasure coding policy on /user/ec

Now copying any content into the /user/ec folder falls into the newly set policy.

Type the command shown as follows to test this:

./bin/hdfs dfs -copyFromLocal ~/Documents/OnlineRetail.csv /user/ec

The following screenshot shows the result of the copying, as expected the system complains as we don't really have a cluster on our local system enough to implement EC. But this should give us an idea of what is needed and how it would look:

Intra-DataNode balancer

While HDFS always had a great feature of balancing the data between the data nodes in the cluster, often this resulted in skewed disks within data nodes. For instance, if you have four disks, two disks might take the bulk of the data and the other two might be under-utilized. Given that physical disks (say 7,200 or 10,000 rpm) are slow to read/write, this kind of skewing of data results in poor performance. Using an intra-node balancer, we can rebalance the data amongst the disks.

Run the command shown in the following example to invoke disk balancing on a DataNode:

./bin/hdfs diskbalancer -plan 10.0.0.103

The following is the output of the disk balancer command:

Installing YARN timeline service v.2

As stated in the YARN timeline service v.2 section, v.2 always selects Apache HBase as the primary backing storage, since Apache HBase scales well even to larger clusters and continues to maintain a good read and write response time.

There are a few steps that need to be performed to prepare the storage for timeline service v.2:

  1. Set up the HBase cluster
  2. Enable the co-processor
  3. Create the schema for timeline service v.2

Each step is explained in more detail in the following sections.

Setting up the HBase cluster

The first step involves picking an Apache HBase cluster to use as the storage cluster. The version of Apache HBase that is supported with the timeline service v.2 is 1.2.6. The 1.0.x versions no longer work with timeline service v.2. Later versions of HBase have not been tested yet with the timeline service.

Simple deployment for HBase

If you are intent on a simple deploy profile for the Apache HBase cluster where the data loading is light but the data needs to persist across node comings and goings, you could consider the Standalone HBase over HDFS deploy mode. 

http://mirror.cogentco.com/pub/apache/hbase/1.2.6/

The following screenshot is the download link to HBase 1.2.6:

Download hbase-1.2.6-bin.tar.gz to your local machine. Then extract the HBase binaries:

tar -xvzf hbase-1.2.6-bin.tar.gz

The following is the content of the extracted HBase:

This is a useful variation on the standalone HBase setup and has all HBase daemons running inside one JVM but rather than persisting to the local filesystem, it persists to an HDFS instance. Writing to HDFS where data is replicated ensures that data is persisted across node comings and goings. To configure this standalone variant, edit your hbasesite.xml setting the hbase.rootdir to point at a directory in your HDFS instance but then set hbase.cluster.distributed to false

The following is the hbase-site.xml with the hdfs port 9000 for the local cluster we have installed mentioned as a property. If you leave this out there wont be a HBase cluster installed.

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>false</value>
    </property>
</configuration>

Next step is to start HBase. We will do this by using start-hbase.sh script:

./bin/start-hbase.sh

The following screenshot shows the HBase cluster we just installed:

The following screenshot shows are more attributes of the HBase cluster setup showing versions, of various components:

Figure: Screenshot of attributes of the HBase cluster setup and the versions of different components

Once you have an Apache HBase cluster ready to use, perform the steps in the following  section.

Enabling the co-processor

In this version, the co-processor is loaded dynamically.

Copy the timeline service .jar to HDFS from where HBase can load it. It is needed for the flowrun table creation in the schema creator. The default HDFS location is /hbase/coprocessor.

For example:

hadoop fs -mkdir /hbase/coprocessor hadoop fs -put hadoop-yarn-server-timelineservice-hbase-3.0.0-alpha1-SNAPSHOT.jar /hbase/coprocessor/hadoop-yarn-server-timelineservice.jar

To place the JAR at a different location on HDFS, there also exists a YARN configuration setting called yarn.timeline-service.hbase.coprocessor.jar.hdfs.location, shown as follows:

<property>
  <name>yarn.timeline-service.hbase.coprocessor.jar.hdfs.location</name>
  <value>/custom/hdfs/path/jarName</value>
</property>

Create the timeline service schema using the schema creator tool. For this to happen, we also need to make sure the JARs are all found correctly:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/Users/sridharalla/hbase-1.2.6/lib/:/Users/sridharalla/hadoop-3.1.0/share/hadoop/yarn/timelineservice/

Once we have the classpath corrected, we can create the HBase schema/tables using a simple command, shown as follows:

./bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create -skipExistingTable

The following is the HBase schema created as a result of the preceding command:

Enabling timeline service v.2

The following are the basic configurations to start timeline service v.2:

<property>
  <name>yarn.timeline-service.version</name>
  <value>2.0f</value>
</property>

<property>
  <name>yarn.timeline-service.enabled</name>
  <value>true</value>
</property>

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle,timeline_collector</value>
</property>

<property>
  <name>yarn.nodemanager.aux-services.timeline_collector.class</name>
  <value>org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService</value>
</property>

<property>
  <description> This setting indicates if the yarn system metrics is published by RM and NM by on the timeline service. </description>
  <name>yarn.system-metrics-publisher.enabled</name>
  <value>true</value>
</property>

<property>
  <description>This setting is to indicate if the yarn container events are published by RM to the timeline service or not. This configuration is for ATS V2. </description>
  <name>yarn.rm.system-metrics-publisher.emit-container-events</name>
  <value>true</value>
</property>

Also, add the hbase-site.xml configuration file to the client Hadoop cluster configuration so that it can write data to the Apache HBase cluster you are using, or set yarn.timeline-service.hbase.configuration.file to the file URL pointing to hbase-site.xml for the same purpose of writing the data to HBase, for example:

<property>
  <description>This is an Optional URL to an hbase-site.xml configuration file. It is to be used to connect to the timeline-service hbase cluster. If it is empty or not specified, the HBase configuration will be loaded from the classpath. Else, they will override those from the ones present on the classpath. </description>
  <name>yarn.timeline-service.hbase.configuration.file</name>
  <value>file:/etc/hbase/hbase-site.xml</value>
</property>
Running timeline service v.2

Restart the ResourceManager as well as the node managers to pick up the new configuration. The collectors start within the resource manager and the node managers in an embedded manner.

The timeline service reader is a separate YARN daemon, and it can be started using the following syntax:

$ yarn-daemon.sh start timelinereader
Enabling MapReduce to write to timeline service v.2

To write MapReduce framework data to timeline service v.2, enable the following configuration in mapred-site.xml:

<property>
  <name>mapreduce.job.emit-timeline-data</name>
  <value>true</value>
</property>

The timeline service is still evolving so you should try it out only to test out the features and not in production, and wait for the more widely adopted version, which should be available sometime soon.