Hadoop Real-World Solutions Cookbook

Hadoop Real-World Solutions Cookbook - Second Edition

By : Tanmay Deshpande

Buy this Book

Hadoop Real-World Solutions Cookbook - Second Edition

By: Tanmay Deshpande

Buy this Book

Overview of this book

Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization. Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book. This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business.

Hadoop Real-World Solutions Cookbook Second Edition

Credits

About the Author

Acknowledgements

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started with Hadoop 2.X

Introduction

Installing a single-node Hadoop Cluster

Installing a multi-node Hadoop cluster

Adding new nodes to existing Hadoop clusters

Executing the balancer command for uniform data distribution

Entering and exiting from the safe mode in a Hadoop cluster

Decommissioning DataNodes

Performing benchmarking on a Hadoop cluster

Exploring HDFS

Introduction

Loading data from a local machine to HDFS

Exporting HDFS data to a local machine

Changing the replication factor of an existing file in HDFS

Setting the HDFS block size for all the files in a cluster

Setting the HDFS block size for a specific file in a cluster

Enabling transparent encryption for HDFS

Importing data from another Hadoop cluster

Recycling deleted data from trash to HDFS

Saving compressed data in HDFS

Mastering Map Reduce Programs

Introduction

Writing the Map Reduce program in Java to analyze web log data

Executing the Map Reduce program in a Hadoop cluster

Adding support for a new writable data type in Hadoop

Implementing a user-defined counter in a Map Reduce program

Map Reduce program to find the top X

Map Reduce program to find distinct values

Map Reduce program to partition data using a custom partitioner

Writing Map Reduce results to multiple output files

Performing Reduce side Joins using Map Reduce

Unit testing the Map Reduce code using MRUnit

Data Analysis Using Hive, Pig, and Hbase

Introduction

Storing and processing Hive data in a sequential file format

Storing and processing Hive data in the ORC file format

Storing and processing Hive data in the Parquet file format

Performing FILTER By queries in Pig

Performing Group By queries in Pig

Performing Order By queries in Pig

Performing JOINS in Pig

Writing a user-defined function in Pig

Analyzing web log data using Pig

Performing the Hbase operation in CLI

Performing Hbase operations in Java

Executing the MapReduce programming with an Hbase Table

Advanced Data Analysis Using Hive

Introduction

Processing JSON data in Hive using JSON SerDe

Processing XML data in Hive using XML SerDe

Processing Hive data in the Avro format

Writing a user-defined function in Hive

Performing table joins in Hive

Executing map side joins in Hive

Performing context Ngram in Hive

Call Data Record Analytics using Hive

Twitter sentiment analysis using Hive

Implementing Change Data Capture using Hive

Multiple table inserting using Hive

Data Import/Export Using Sqoop and Flume

Introduction

Importing data from RDMBS to HDFS using Sqoop

Exporting data from HDFS to RDBMS

Using query operator in Sqoop import

Importing data using Sqoop in compressed format

Performing Atomic export using Sqoop

Importing data into Hive tables using Sqoop

Importing data into HDFS from Mainframes

Incremental import using Sqoop

Creating and executing Sqoop job

Importing data from RDBMS to Hbase using Sqoop

Importing Twitter data into HDFS using Flume

Importing data from Kafka into HDFS using Flume

Importing web logs data into HDFS using Flume

Automation of Hadoop Tasks Using Oozie

Introduction

Implementing a Sqoop action job using Oozie

Implementing a Map Reduce action job using Oozie

Implementing a Java action job using Oozie

Implementing a Hive action job using Oozie

Implementing a Pig action job using Oozie

Implementing an e-mail action job using Oozie

Executing parallel jobs using Oozie (fork)

Scheduling a job in Oozie

Machine Learning and Predictive Analytics Using Mahout and R

Introduction

Setting up the Mahout development environment

Creating an item-based recommendation engine using Mahout

Creating a user-based recommendation engine using Mahout

Using Predictive analytics on Bank Data using Mahout

Clustering text data using K-Means

Performing Population Data Analytics using R

Performing Twitter Sentiment Analytics using R

Performing Predictive Analytics using R

Integration with Apache Spark

Introduction

Running Spark standalone

Running Spark on YARN

Olympics Athletes analytics using the Spark Shell

Creating Twitter trending topics using Spark Streaming

Analyzing Parquet files using Spark

Analyzing JSON data using Spark

Processing graphs using Graph X

Conducting predictive analytics using Spark MLib

Hadoop Use Cases

Introduction

Call Data Record analytics

Web log analytics

Sensitive data masking and encryption using Hadoop

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing a single-node Hadoop Cluster

In this recipe, we are going to learn how to install a single-node Hadoop cluster, which can be used for development and testing.

Getting ready

To install Hadoop, you need to have a machine with the UNIX operating system installed on it. You can choose from any well known UNIX OS such as Red Hat, CentOS, Ubuntu, Fedora, and Amazon Linux (this is in case you are using Amazon Web Service instances).

Here, we will be using the Ubuntu distribution for demonstration purposes.

How to do it...

Let's start installing Hadoop:

First of all, you need to download the required installers from the Internet. Here, we need to download Java and Hadoop installers. The following are the links to do this:
For the Java download, choose the latest version of the available JDK from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.
You can also use Open JDK instead of Oracle.
For the Hadoop 2.7 Download, go to
http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz.
We will first install Java. Here, I am using /usr/local as the installation directory and the root user for all installations. You can choose a directory of your choice.
Extract tar.gz like this:
```
tar -xzf java-7-oracle.tar.gz
```
Rename the extracted folder to give the shorter name Java instead of java-7-oracle. Doing this will help you remember the folder name easily.
Alternately, you can install Java using the apt-get package manager if your machine is connected to the Internet:
```
sudo apt-get update
sudo apt-get install openjdk-7-jdk
```
Similarly, we will extract and configure Hadoop. We will also rename the extracted folder for easier accessibility. Here, we will extract Hadoop to path /usr/local:
```
tar –xzf hadoop-2.7.0.tar.gz
mv hadoop-2.7.0 hadoop
```

Next, in order to use Java and Hadoop from any folder, we would need to add these paths to the ~/.bashrc file. The contents of the file get executed every time a user logs in:

cd ~
vi .bashrc

Once the file is open, append the following environment variable settings to it. These variables are used by Java and Hadoop at runtime:

export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

In order to verify whether our installation is perfect, close the terminal and restart it again. Also, check whether the Java and Hadoop versions can be seen:

$java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Server VM (build 24.45-b08, mixed mode)

$ hadoop version
Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f

This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.0.jar.

Now that Hadoop and Java are installed and verified, we need to install ssh (Secure Shell) if it's not already available by default. If you are connected to the Internet, execute the following commands. SSH is used to secure data transfers between nodes:
```
sudo apt-get install openssh-client
sudo apt-get install openssh-server
```
Once the ssh installation is done, we need to execute the ssh configuration in order to avail a passwordless access to remote hosts. Note that even though we are installing Hadoop on a single node, we need to perform an ssh configuration in order to securely access the localhost.
First of all, we need to generate public and private keys by executing the following command:
```
ssh-keygen -t rsa -P ""
```
This will generate the private and public keys by default in the $HOME/.ssh folder. In order to provide passwordless access, we need to append the public key to authorized_keys file:
```
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
```
Let's check whether the ssh configuration is okay or not. To test it, execute and connect to the localhost like this:
```
ssh localhost
```
This will prompt you to confirm whether to add this connection to the known_hosts file. Type yes, and you should be connected to ssh without prompting for the password.
Once the ssh configuration is done and verified, we need to configure Hadoop. The Hadoop configuration begins with adding various configuration parameters to the following default files:
- hadoop-env.sh: This is where we need to perform the Java environment variable configuration.
- core-site.xml: This is where we need to perform NameNode-related configurations.
- yarn-site.xml: This is where we need to perform configurations related to Yet Another Resource Negotiator (YARN).
- mapred-site.xml: This is where we need to the map reduce engine as YARN.
- hdfs-site.xml: This is where we need to perform configurations related to Hadoop Distributed File System (HDFS).
These configuration files can be found in the /usr/local/hadoop/etc/hadoop folder. If you install Hadoop as the root user, you will have access to edit these files, but if not, you will first need to get access to this folder before editing.

So, let's take a look at the configurations one by one.

Configure hadoop-env.sh and update the Java path like this:

Export JAVA_HOME=/usr/local/java.
Edit core-site.xml, and add the host and port on which you wish to install NameNode. Here is the single node installation that we would need in order to add the localhost:
```
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000/</value>
  </property>
</configuration>
```
Edit yarn-site.xml, add the following properties to it:
```
<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
</configuration>
```
The yarn.nodemanager.aux-services property tells NodeManager that an auxiliary service named mapreduce.shuffle is present and needs to be implemented. The second property tells NodeManager about the class by which means it needs to implement the shuffle auxiliary service. This specific configuration is needed as the MapReduce job involves shuffling of key value pairs.

Next, edit mapred-site.xml to set the map reduce processing engine as YARN:

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Edit hdfs-site.xml to set the folder paths that can be used by NameNode and datanode:

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
</configuration>

I am also setting the HDFS block replication factor to 1 as this is a single node cluster installation.
We also need to make sure that we create the previously mentioned folders and change their ownership to suit the current user. To do this, you can choose a folder path of your own choice:
```
sudo mkdir –p /usr/local/store/hdfs/namenode
sudo mkdir –p /usr/local/store/hdfs/datanode
sudo chown root:root –R /usr/local/store 
```
Now, it's time to format namenode so that it creates the required folder structure by default:
```
hadoop namenode -format
```
The final step involves starting Hadoop daemons; here, we will first execute two scripts to start HDFS daemons and then start YARN daemons:
```
/usr/local/hadoop/sbin/start-dfs.sh
```

This will start NameNode, the secondary NameNode, and then DataNode daemons:

/usr/local/hadoop/sbin/start-yarn.sh

This will start NodeManager and ResourceManager. You can execute the jps command to take a look at the running daemons:

$jps
2184 DataNode
2765 NodeManager
2835 Jps
2403 SecondaryNameNode
2025 NameNode
2606 ResourceManager

We can also access the web portals for HDFS and YARN by accessing the following URLs:

For HDFS: http://<hostname>:50070/
For YARN: http://<hostname>:8088/

How it works...

Hadoop 2.0 has been majorly reformed in order to solve issues of scalability and high-availability. Earlier in Hadoop 1.0, Map Reduce was the only means of processing data stored in HDFS. With advancement of YARN, Map Reduce is one of the ways of processing data on Hadoop. Here is a pictorial difference between Hadoop 1.x and Hadoop 2.x:

Now, let's try to understand how HDFS and YARN works.

Hadoop Distributed File System (HDFS)

HDFS is a redundant, reliable storage for Hadoop. It consists of three important parts: NameNode, the secondary NameNode, and DataNodes. When a file needs to be processed on Hadoop, it first needs to be saved on HDFS. HDFS distributes the file in chunks of 64/128 MB data blocks across the data nodes. The blocks are replicated across data nodes for reliability. NameNode stores the metadata in the blocks and replicas. After a certain period of time, the metadata is backed up on the secondary NameNode. The default time is 60 seconds. We can modify this by setting a property called dfs.namenode.checkpoint.check.period in hdfs-site.xml.

Yet Another Resource Negotiator (YARN)

YARN has been developed to address scalability issues and for the better management of jobs in Hadoop; till date, it has proved itself to be the perfect solution. It is responsible for the management of resources available in clusters. It consists of two important components: ResouceManager(Master) and NodeManager(Worker). NodeManager provides a node-level view of the cluster, while ResourceManager takes a view of a cluster. When an application is submitted by an application client, the following things happen:

The application talks to ResourceManager and provides details about it.
ResourceManager makes a container request on behalf of an application to any of the worker nodes and ApplicationMaster starts running within that container.
ApplicationMaster then makes subsequent requests for the containers to execute tasks on other nodes.
These tasks then take care of all the communication. Once all the tasks are complete, containers are deallocated and ApplicationMaster exits.
After this, the application client also exits.

There's more

Now that your single node Hadoop cluster is up and running, you can try some HDFS file operations on it, such as creating a directory, copying a file from a local machine to HDFS, and so on. Here some sample commands.

To list all the files in the HDFS root directory, take a look at this:

hadoop fs –ls /

To create a new directory, take a look at this:

hadoop fs –mkdir /input

To copy a file from the local machine to HDFS, take a look at this:

hadoop fs –copyFromLocal /usr/local/hadoop/LICENSE.txt /input

In order to access all the command options that are available, go to https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.

Hadoop Real-World Solutions Cookbook - Second Edition

By : Tanmay Deshpande

Hadoop Real-World Solutions Cookbook - Second Edition

By: Tanmay Deshpande

Overview of this book

Related Content you might be interested in

Current Title:

Hadoop Real-World Solutions Cookbook - Second Edition

Installing a single-node Hadoop Cluster

Getting ready

How to do it...

How it works...

Hadoop Distributed File System (HDFS)

Yet Another Resource Negotiator (YARN)

There's more