Scaling Big Data with Hadoop and Solr

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Buy this Book

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Buy this Book

Overview of this book

As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code. Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling. With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.

Scaling Big Data with Hadoop and Solr

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Processing Big Data Using Hadoop and MapReduce

Understanding Apache Hadoop and its ecosystem

Storing large data in HDFS

Creating MapReduce to analyze Hadoop data

Installing and running Hadoop

Managing a Hadoop cluster

Summary

Understanding Solr

Installing Solr

Apache Solr architecture

Configuring Apache Solr search

Loading your data for search

Summary

Making Big Data Work for Hadoop and Solr

The problem

Understanding data-processing workflows

Using Solr 1045 patch – map-side indexing

Using Solr 1301 patch – reduce-side indexing

Using SolrCloud for distributed search

Using Katta for Big Data search (Solr-1395 patch)

Summary

Using Big Data to Build Your Large Indexing

Understanding the concept of NOSQL

The CAP theorem

Understanding the concepts of distributed search

Lily – running Solr and Hadoop together

Deep dive – shards and indexing data of Apache Solr

Configuring SolrCloud to work with large indexes

Summary

Improving Performance of Search while Scaling with Big Data

Understanding the limits

Optimizing the search schema

Index optimization

Optimization the search runtime

Monitoring the Solr instance

Summary

Use Cases for Big Data Search

E-commerce websites

Log management for banking

Creating Enterprise Search Using Apache Solr

Sample MapReduce Programs to Build the Solr Indexes

The Solr-1045 patch – map program

The Solr-1301 patch – reduce-side indexing

Katta

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing and running Hadoop

Installing Hadoop is a straightforward job with a default setup, but as we go on customizing the cluster, it gets difficult. Apache Hadoop can be installed in three different setups: namely standalone mode, single node (pseudo-distributed) setup, and fully distributed setup. Local standalone setup is meant for single machine installation. Standalone mode is very useful for debugging purpose. The other two types of setup are shown in the following diagram:

In pseudo-distributed setup of Hadoop, Hadoop is installed on a single machine; this is mainly for development purpose. In this setup, each Hadoop daemon runs as a separate Java process. A real production based installation would be on multiple nodes or full cluster. Let's look at installing Hadoop and running a simple program on it.

Prerequisites

Hadoop runs on the following operating systems:

All Linux flavors: It supports development as well as production
Win32: It has limited support (only for development) through Cygwin

Hadoop requires the following software:

Java 1.6 onwards
ssh (Secure shell) to run start/stop/status and other such scripts across cluster
Cygwin, which is applicable only in case of Windows

This software can be installed directly using apt-get for Ubuntu, dpkg for Debian, and rpm for Red Hat/Oracle Linux from respective sites. In case of cluster setup, this software should be installed on all the machines.

Setting up SSH without passphrases

Since Hadoop uses SSH to run its scripts on different nodes, it is important to make this SSH login happen without any prompt for password. This can simply be tested by running the ssh command as shown in the following code snippet:

$ssh localhost
  Welcome to Ubuntu (11.0.4)
  
hduser@node1:~/$

If you get a prompt for password, you should perform the following steps on your machine:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

This step will actually create authorization key with SSH, by passing passphrases check. Once this step is complete, you are good to go.

Installing Hadoop on machines

Hadoop can be first downloaded from the Apache Hadoop website (http://hadoop.apache.org). Make sure that you download and choose the correct release from different releases, which is stable release, latest beta/alpha release, and legacy stable version. You can choose to download the package or download the source, compile it on your OS, and then install it. Using operating system package installer, install the Hadoop package.

To setup a single pseudo node cluster, you can simply run the following script provided by Apache:

$ hadoop-setup-single-node.sh

Say Yes to all the options. This will setup a single node on your machine, you do not need to further change any configuration, and it will run by default. You can test it by running any of the Hadoop command discussed in HDFS section of this chapter.

For a cluster setup, the SSH passphrase should be set on all the nodes, to bypass prompt for password while starting and stopping TaskTracker/DataNodes on all the slaves from masters. You need to install Hadoop on all the machines which are going to participate in the Hadoop cluster. You also need to understand the Hadoop configuration file structure, and make modifications to it.

Hadoop configuration

Major Hadoop configuration is specified in the following configuration files, kept in the $HADOOP_HOME/conf folder of the installation:

File name	Description
`core-site.xml`	In this file, you can modify the default properties of Hadoop. This covers setting up different protocols for interaction, working directories, log management, security, buffer and blocks, temporary files, and so on.
`hdfs-site.xml`	This file stores the entire configuration related to HDFS. So properties such as DFS site address, data directory, replication factors, and so on, are covered in these files.
`mapred-site.xml`	This file is responsible for handling the entire configuration related to the MapReduce framework. This covers configuration for JobTracker and TaskTracker, properties for Job.
`common-logging.properties`	This file specifies the default logger used by Hadoop; you can override it to use your logger.
`capacity-scheduler.xml`	This file is mainly used by resource manager in Hadoop for setting up scheduling parameters of job queues.
`fair-scheduler.xml`	This file contains information about user allocations and pooling information for fair scheduler. It is currently under development.
`hadoop-env.sh`	All the environment variables are defined in this file; you can change any of the environments, that is, Java location, Hadoop configuration directory, and so on.
`hadoop-policy.xml`	This file is used to define various access control lists for Hadoop services. This can control who all can use Hadoop cluster for execution.
`Masters/slaves`	In this file, you can define the hostname for master and slaves. Master file lists all the masters, and Slave file lists the slave nodes. To run Hadoop in cluster mode, you need to modify these files to point to the respective master and slaves on all nodes.
`Log4j.properties`	You can define various log levels for your instance, helpful while developing or debugging the Hadoop programs. You can define levels for logging.

The files marked in bold letters are the files that you will definitely modify to set up your basic Hadoop cluster.

Running a program on Hadoop

You can start your cluster with the following command; once started, you will see the output shown as follows:

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

  Starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
  localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
  localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
  localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$

Now we can test the functioning of this cluster by running sample examples shipped with Hadoop installation. First, copy some files from your local directory on HDFS and you can run following command:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /home/myuser/data /user/myuser/data

Run hadoop dfs –ls on your Hadoop instance to check whether the files are loaded in HDFS. Now, you can run the simple word count program to count the number of words in all these files.

bin/hadoop jar hadoop*examples*.jar wordcount /user/myuser/data /user/myuser/data-output

You will typically find hadoop-example jar in /usr/share/hadoop, or in $HADOOP_HOME. Once it runs, you can run hadoop dfs cat on data-output to list the output.

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr

Installing and running Hadoop

Prerequisites

Setting up SSH without passphrases

Installing Hadoop on machines

Hadoop configuration

Running a program on Hadoop