Book Image

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

<p>As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities.</p> <p>Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code.</p> <p>Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling.</p> <p>With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.</p>
Table of Contents (15 chapters)
Scaling Big Data with Hadoop and Solr
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Installing and running Hadoop


Installing Hadoop is a straightforward job with a default setup, but as we go on customizing the cluster, it gets difficult. Apache Hadoop can be installed in three different setups: namely standalone mode, single node (pseudo-distributed) setup, and fully distributed setup. Local standalone setup is meant for single machine installation. Standalone mode is very useful for debugging purpose. The other two types of setup are shown in the following diagram:

In pseudo-distributed setup of Hadoop, Hadoop is installed on a single machine; this is mainly for development purpose. In this setup, each Hadoop daemon runs as a separate Java process. A real production based installation would be on multiple nodes or full cluster. Let's look at installing Hadoop and running a simple program on it.

Prerequisites

Hadoop runs on the following operating systems:

  • All Linux flavors: It supports development as well as production

  • Win32: It has limited support (only for development) through Cygwin

Hadoop requires the following software:

  • Java 1.6 onwards

  • ssh (Secure shell) to run start/stop/status and other such scripts across cluster

  • Cygwin, which is applicable only in case of Windows

This software can be installed directly using apt-get for Ubuntu, dpkg for Debian, and rpm for Red Hat/Oracle Linux from respective sites. In case of cluster setup, this software should be installed on all the machines.

Setting up SSH without passphrases

Since Hadoop uses SSH to run its scripts on different nodes, it is important to make this SSH login happen without any prompt for password. This can simply be tested by running the ssh command as shown in the following code snippet:

$ssh localhost
  Welcome to Ubuntu (11.0.4)
  
hduser@node1:~/$

If you get a prompt for password, you should perform the following steps on your machine:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

This step will actually create authorization key with SSH, by passing passphrases check. Once this step is complete, you are good to go.

Installing Hadoop on machines

Hadoop can be first downloaded from the Apache Hadoop website (http://hadoop.apache.org). Make sure that you download and choose the correct release from different releases, which is stable release, latest beta/alpha release, and legacy stable version. You can choose to download the package or download the source, compile it on your OS, and then install it. Using operating system package installer, install the Hadoop package.

To setup a single pseudo node cluster, you can simply run the following script provided by Apache:

$ hadoop-setup-single-node.sh

Say Yes to all the options. This will setup a single node on your machine, you do not need to further change any configuration, and it will run by default. You can test it by running any of the Hadoop command discussed in HDFS section of this chapter.

For a cluster setup, the SSH passphrase should be set on all the nodes, to bypass prompt for password while starting and stopping TaskTracker/DataNodes on all the slaves from masters. You need to install Hadoop on all the machines which are going to participate in the Hadoop cluster. You also need to understand the Hadoop configuration file structure, and make modifications to it.

Hadoop configuration

Major Hadoop configuration is specified in the following configuration files, kept in the $HADOOP_HOME/conf folder of the installation:

File name

Description

core-site.xml

In this file, you can modify the default properties of Hadoop. This covers setting up different protocols for interaction, working directories, log management, security, buffer and blocks, temporary files, and so on.

hdfs-site.xml

This file stores the entire configuration related to HDFS. So properties such as DFS site address, data directory, replication factors, and so on, are covered in these files.

mapred-site.xml

This file is responsible for handling the entire configuration related to the MapReduce framework. This covers configuration for JobTracker and TaskTracker, properties for Job.

common-logging.properties

This file specifies the default logger used by Hadoop; you can override it to use your logger.

capacity-scheduler.xml

This file is mainly used by resource manager in Hadoop for setting up scheduling parameters of job queues.

fair-scheduler.xml

This file contains information about user allocations and pooling information for fair scheduler. It is currently under development.

hadoop-env.sh

All the environment variables are defined in this file; you can change any of the environments, that is, Java location, Hadoop configuration directory, and so on.

hadoop-policy.xml

This file is used to define various access control lists for Hadoop services. This can control who all can use Hadoop cluster for execution.

Masters/slaves

In this file, you can define the hostname for master and slaves. Master file lists all the masters, and Slave file lists the slave nodes. To run Hadoop in cluster mode, you need to modify these files to point to the respective master and slaves on all nodes.

Log4j.properties

You can define various log levels for your instance, helpful while developing or debugging the Hadoop programs. You can define levels for logging.

The files marked in bold letters are the files that you will definitely modify to set up your basic Hadoop cluster.

Running a program on Hadoop

You can start your cluster with the following command; once started, you will see the output shown as follows:

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

  Starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
  localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
  localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
  localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$

Now we can test the functioning of this cluster by running sample examples shipped with Hadoop installation. First, copy some files from your local directory on HDFS and you can run following command:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /home/myuser/data /user/myuser/data

Run hadoop dfs –ls on your Hadoop instance to check whether the files are loaded in HDFS. Now, you can run the simple word count program to count the number of words in all these files.

bin/hadoop jar hadoop*examples*.jar wordcount /user/myuser/data /user/myuser/data-output

You will typically find hadoop-example jar in /usr/share/hadoop, or in $HADOOP_HOME. Once it runs, you can run hadoop dfs cat on data-output to list the output.