Scaling Big Data with Hadoop and Solr

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Buy this Book

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Buy this Book

Overview of this book

As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code. Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling. With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.

Scaling Big Data with Hadoop and Solr

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Processing Big Data Using Hadoop and MapReduce

Understanding Apache Hadoop and its ecosystem

Storing large data in HDFS

Creating MapReduce to analyze Hadoop data

Installing and running Hadoop

Managing a Hadoop cluster

Summary

Understanding Solr

Installing Solr

Apache Solr architecture

Configuring Apache Solr search

Loading your data for search

Summary

Making Big Data Work for Hadoop and Solr

The problem

Understanding data-processing workflows

Using Solr 1045 patch – map-side indexing

Using Solr 1301 patch – reduce-side indexing

Using SolrCloud for distributed search

Using Katta for Big Data search (Solr-1395 patch)

Summary

Using Big Data to Build Your Large Indexing

Understanding the concept of NOSQL

The CAP theorem

Understanding the concepts of distributed search

Lily – running Solr and Hadoop together

Deep dive – shards and indexing data of Apache Solr

Configuring SolrCloud to work with large indexes

Summary

Improving Performance of Search while Scaling with Big Data

Understanding the limits

Optimizing the search schema

Index optimization

Optimization the search runtime

Monitoring the Solr instance

Summary

Use Cases for Big Data Search

E-commerce websites

Log management for banking

Creating Enterprise Search Using Apache Solr

Sample MapReduce Programs to Build the Solr Indexes

The Solr-1045 patch – map program

The Solr-1301 patch – reduce-side indexing

Katta

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Storing large data in HDFS

Hadoop distributed file system (HDFS) is a subproject of Apache foundation. It is designed to maintain large data/files in a distributed manner reliably. HDFS uses master-slave based architecture and is designed to run on low- cost hardware. It is a distributed file system which provides high speed data access across distributed network. It also provides APIs to manage its file system. To handle failures of nodes, HDFS effectively uses data replication of file blocks across multiple Hadoop cluster nodes, thereby avoiding any data loss during node failures. HDFS stores its metadata and application data separately. Let's understand its architecture.

HDFS architecture

HDFS, being a distributed file system, has the following major objectives to satisfy to be effective:

Handling large chunks of data
High availability, and handling hardware failures seamlessly
Streaming access to its data
Scalability to perform better with addition of hardware
Durability with no loss of data in spite of failures
Portability across different types of hardware/software
Data partitioning across multiple nodes in a cluster

HDFS satisfies most of these goals effectively. The following diagram depicts the system architecture of HDFS. Let's understand each of the components in detail.

NameNode

All the metadata related to HDFS is stored on NameNode. Besides storing metadata, NameNode is the master node which performs coordination activities among data nodes such as data replication across data nodes, naming system such as filenames, their disk locations, and so on. NameNode stores the mapping of blocks to the DataNodes. In a Hadoop cluster, there can only be one single active NameNode. NameNode regulates access to its file system with the use of HDFS based APIs to create, open, edit, and delete HDFS files. The data structure for storing file information is inspired from a UNIX-like filesystem. Each block is indexed, and its index node (inode) mapping is available in memory (RAM) for faster access. NameNode is a multithreaded process and can serve multiple clients at a time.

Any transaction first gets recorded in journal, and the journal file, after completion is flushed and response is sent back to the client. If there is any error while flushing journal to disk, NameNode simply excludes that storage, and moves on with another. NameNode shuts itself down in case no storage directory is available.

Note

Safe mode: When a cluster is started, NameNode starts its complete functionality only when configured minimum percentage of block satisfies the minimum replication. Otherwise, it goes into safe mode. When NameNode is in safe mode state, it does not allow any modification to its file systems. This can be turned off manually by running the following command:

$  hadoop dfsadmin – safemode  leave

DataNode

DataNodes are nothing but slaves that are deployed on all the nodes in a Hadoop cluster. DataNode is responsible for storing the application's data. Each uploaded data file in HDFS is split into multiple blocks, and these data blocks are stored on different data nodes. Default file block size in HDFS is 64 MB. Each Hadoop file block is mapped to two files in data node, one file is the file block data, while the other one is checksum.

When Hadoop is started, each data node connects to NameNode informing its availability to serve the requests. When system is started, the namespace ID and software versions are verified by NameNode, and DataNode sends block report describing what all data blocks it holds to NameNode on startup. During runtime, each DataNode periodically sends NameNode a heartbeat signal, confirming its availability. The default duration between two heartbeats is 3 seconds. NameNode assumes unavailability of DataNode if it does not receive a heartbeat in 10 minutes by default; in that case NameNode does replication of data blocks of that DataNode to other DataNodes. Heartbeat carries information about disk space available, in-use space, data transfer load, and so on. Heartbeat provides primary handshaking across NameNode and DataNode; based on heartbeat information, NameNode chooses next block storage preference, thus balancing the load in the cluster. NameNode effectively uses heartbeat replies to communicate to DataNode regarding block replication to other DataNodes, removal of any blocks, requests for block reports, and so on.

Secondary NameNode

Hadoop runs with single NameNode, which in turn causes it to be a single point of failure for the cluster. To avoid this issue, and to create backup for primary NameNode, the concept of secondary NameNode was introduced recently in the Hadoop framework. While NameNode is busy serving request to various clients, secondary NameNode looks after maintaining a copy of up-to-date memory snapshot of NameNode. These are also called checkpoints.

Secondary NameNode usually runs on a different node other than NameNode, this ensures durability of NameNode. In addition to secondary NameNode, Hadoop also supports CheckpointNode, which creates period checkpoints instead of running a sync of memory with NameNode. In case of failure of NameNode, the recovery is possible up to the last checkpoint snapshot taken by CheckpointNode.

Organizing data

Hadoop distributed file system supports traditional hierarchy based file system (such as UNIX), where user can create their own home directories, subdirectories, and store files in these directories. It allows users to create, rename, move, and delete files as well as directories. There is a root directory denoted with slash (/), and all subdirectories can be created under this root directory, for example /user/foo.

Note

The default data replication factor on HDFS is three; however one can change this by modifying HDFS configuration files.

Data is organized in multiple data blocks, each comprising 64 MB size by default. Any new file created on HDFS first goes through a stage, where this file is cached on local storage until it reaches the size of one block, and then the client sends a request to NameNode. NameNode, looking at its load on DataNodes, sends information about destination block location and node ID to the client, then client flushes the data to the targeted DataNodes the from local file. In case of unflushed data, if the client flushes the file, the same is sent to DataNode for storage. The data is replicated at multiple nodes through a replication pipeline.

Accessing HDFS

HDFS can be accessed in the following different ways:

Java APIs
Hadoop command line APIs (FS shell)
C/C++ language wrapper APIs
WebDAV (work in progress)
DFSAdmin (command set for administration)
RESTful APIs for HDFS

Similarly, to expose HDFS APIs to rest of the language stacks, there is a separate project called HDFS-APIs (http://wiki.apache.org/hadoop/HDFS-APIs), based on the Thrift framework which allows scalable cross-language service APIs to Perl, Python, Ruby, and PHP. Let's look at the supported operations with HDFS.

Hadoop operations	Syntax	Example
Creating a directory	`hadoop dfs -mkdir URI`	`hadoop dfs -mkdir /users/abc`
Importing file from local file store	`hadoop dfs -copyFromLocal <localsrc> URI`	`hadoop dfs -copyFromLocal /home/user1/info.txt /users/abc`
Exporting file to local file store	`hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>`	`hadoop dfs -copyToLocal /users/abc/info.txt /home/user1`
Opening and reading a file	`hadoop dfs -cat URI [URI …]`	`hadoop dfs -cat /users/abc/info.txt`
Copy files in Hadoop	`hadoop dfs -cp URI [URI …] <dest>`	`hadoop dfs -cp /users/abc/* /users/bcd/`
Moving or renaming a file or directory	`hadoop dfs -mv URI [URI …] <dest>`	`hadoop dfs -cp /users/abc/output /users/bcd/`
Delete a file or directory, recursive delete	`hadoop dfs -rm [-skipTrash] URI [URI …]`	`hadoop dfs -rm /users/abc/info.txt`
Get status of file or directory, size, other information	`hadoop dfs -du <args>`	`hadoop dfs -du /users/abc/info.txt`
List a file or directory	`hadoop dfs -ls <args>`	`hadoop dfs -ls /users/abc`
Get different attributes of file/directory	`hadoop dfs -stat URI`	`hadoop dfs -stat /users/abc`
Change permissions (single/recursive) of file or directory	`hadoop dfs -chmod [-R] MODE URI [URI …]`	`hadoop dfs -chmod 755 /users/abc`
Set owner for file/directory	`hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI`	`hadoop dfs -chown -R hrishi /users/hrishi/home`
Setting replication factor	`hadoop dfs -setrep [-R] <path>`	`hadoop dfs -setrep -w 3 -R /user/hadoop/dir1`
Change group permissions with file	`hadoop dfs -chgrp [-R] GROUP URI [URI …]`	`hadoop dfs -chgrp -R abc /users/abc`
Getting the count of files and directories	`hadoop dfs -count [-q] <paths>`	`hadoop dfs -count /users/abc`

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr

Storing large data in HDFS

HDFS architecture

NameNode

Note

DataNode

Secondary NameNode

Organizing data

Note

Accessing HDFS