Book Image

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

<p>As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities.</p> <p>Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code.</p> <p>Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling.</p> <p>With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.</p>
Table of Contents (15 chapters)
Scaling Big Data with Hadoop and Solr
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Storing large data in HDFS


Hadoop distributed file system (HDFS) is a subproject of Apache foundation. It is designed to maintain large data/files in a distributed manner reliably. HDFS uses master-slave based architecture and is designed to run on low- cost hardware. It is a distributed file system which provides high speed data access across distributed network. It also provides APIs to manage its file system. To handle failures of nodes, HDFS effectively uses data replication of file blocks across multiple Hadoop cluster nodes, thereby avoiding any data loss during node failures. HDFS stores its metadata and application data separately. Let's understand its architecture.

HDFS architecture

HDFS, being a distributed file system, has the following major objectives to satisfy to be effective:

  • Handling large chunks of data

  • High availability, and handling hardware failures seamlessly

  • Streaming access to its data

  • Scalability to perform better with addition of hardware

  • Durability with no loss of data in spite of failures

  • Portability across different types of hardware/software

  • Data partitioning across multiple nodes in a cluster

HDFS satisfies most of these goals effectively. The following diagram depicts the system architecture of HDFS. Let's understand each of the components in detail.

NameNode

All the metadata related to HDFS is stored on NameNode. Besides storing metadata, NameNode is the master node which performs coordination activities among data nodes such as data replication across data nodes, naming system such as filenames, their disk locations, and so on. NameNode stores the mapping of blocks to the DataNodes. In a Hadoop cluster, there can only be one single active NameNode. NameNode regulates access to its file system with the use of HDFS based APIs to create, open, edit, and delete HDFS files. The data structure for storing file information is inspired from a UNIX-like filesystem. Each block is indexed, and its index node (inode) mapping is available in memory (RAM) for faster access. NameNode is a multithreaded process and can serve multiple clients at a time.

Any transaction first gets recorded in journal, and the journal file, after completion is flushed and response is sent back to the client. If there is any error while flushing journal to disk, NameNode simply excludes that storage, and moves on with another. NameNode shuts itself down in case no storage directory is available.

Note

Safe mode: When a cluster is started, NameNode starts its complete functionality only when configured minimum percentage of block satisfies the minimum replication. Otherwise, it goes into safe mode. When NameNode is in safe mode state, it does not allow any modification to its file systems. This can be turned off manually by running the following command:

$  hadoop dfsadmin – safemode  leave

DataNode

DataNodes are nothing but slaves that are deployed on all the nodes in a Hadoop cluster. DataNode is responsible for storing the application's data. Each uploaded data file in HDFS is split into multiple blocks, and these data blocks are stored on different data nodes. Default file block size in HDFS is 64 MB. Each Hadoop file block is mapped to two files in data node, one file is the file block data, while the other one is checksum.

When Hadoop is started, each data node connects to NameNode informing its availability to serve the requests. When system is started, the namespace ID and software versions are verified by NameNode, and DataNode sends block report describing what all data blocks it holds to NameNode on startup. During runtime, each DataNode periodically sends NameNode a heartbeat signal, confirming its availability. The default duration between two heartbeats is 3 seconds. NameNode assumes unavailability of DataNode if it does not receive a heartbeat in 10 minutes by default; in that case NameNode does replication of data blocks of that DataNode to other DataNodes. Heartbeat carries information about disk space available, in-use space, data transfer load, and so on. Heartbeat provides primary handshaking across NameNode and DataNode; based on heartbeat information, NameNode chooses next block storage preference, thus balancing the load in the cluster. NameNode effectively uses heartbeat replies to communicate to DataNode regarding block replication to other DataNodes, removal of any blocks, requests for block reports, and so on.

Secondary NameNode

Hadoop runs with single NameNode, which in turn causes it to be a single point of failure for the cluster. To avoid this issue, and to create backup for primary NameNode, the concept of secondary NameNode was introduced recently in the Hadoop framework. While NameNode is busy serving request to various clients, secondary NameNode looks after maintaining a copy of up-to-date memory snapshot of NameNode. These are also called checkpoints.

Secondary NameNode usually runs on a different node other than NameNode, this ensures durability of NameNode. In addition to secondary NameNode, Hadoop also supports CheckpointNode, which creates period checkpoints instead of running a sync of memory with NameNode. In case of failure of NameNode, the recovery is possible up to the last checkpoint snapshot taken by CheckpointNode.

Organizing data

Hadoop distributed file system supports traditional hierarchy based file system (such as UNIX), where user can create their own home directories, subdirectories, and store files in these directories. It allows users to create, rename, move, and delete files as well as directories. There is a root directory denoted with slash (/), and all subdirectories can be created under this root directory, for example /user/foo.

Note

The default data replication factor on HDFS is three; however one can change this by modifying HDFS configuration files.

Data is organized in multiple data blocks, each comprising 64 MB size by default. Any new file created on HDFS first goes through a stage, where this file is cached on local storage until it reaches the size of one block, and then the client sends a request to NameNode. NameNode, looking at its load on DataNodes, sends information about destination block location and node ID to the client, then client flushes the data to the targeted DataNodes the from local file. In case of unflushed data, if the client flushes the file, the same is sent to DataNode for storage. The data is replicated at multiple nodes through a replication pipeline.

Accessing HDFS

HDFS can be accessed in the following different ways:

  • Java APIs

  • Hadoop command line APIs (FS shell)

  • C/C++ language wrapper APIs

  • WebDAV (work in progress)

  • DFSAdmin (command set for administration)

  • RESTful APIs for HDFS

Similarly, to expose HDFS APIs to rest of the language stacks, there is a separate project called HDFS-APIs (http://wiki.apache.org/hadoop/HDFS-APIs), based on the Thrift framework which allows scalable cross-language service APIs to Perl, Python, Ruby, and PHP. Let's look at the supported operations with HDFS.

Hadoop operations

Syntax

Example

Creating a directory

hadoop dfs -mkdir URI

hadoop dfs -mkdir /users/abc

Importing file from local file store

hadoop dfs -copyFromLocal <localsrc> URI

hadoop dfs -copyFromLocal /home/user1/info.txt /users/abc

Exporting file to local file store

hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

hadoop dfs -copyToLocal /users/abc/info.txt /home/user1

Opening and reading a file

hadoop dfs -cat URI [URI …]

hadoop dfs -cat /users/abc/info.txt

Copy files in Hadoop

hadoop dfs -cp URI [URI …] <dest>

hadoop dfs -cp /users/abc/* /users/bcd/

Moving or renaming a file or directory

hadoop dfs -mv URI [URI …] <dest>

hadoop dfs -cp /users/abc/output /users/bcd/

Delete a file or directory, recursive delete

hadoop dfs -rm [-skipTrash] URI [URI …]

hadoop dfs -rm /users/abc/info.txt

Get status of file or directory, size, other information

hadoop dfs -du <args>

hadoop dfs -du /users/abc/info.txt

List a file or directory

hadoop dfs -ls <args>

hadoop dfs -ls /users/abc

Get different attributes of file/directory

hadoop dfs -stat URI

hadoop dfs -stat /users/abc

Change permissions (single/recursive) of file or directory

hadoop dfs -chmod [-R] MODE URI [URI …]

hadoop dfs -chmod 755 /users/abc

Set owner for file/directory

hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI

hadoop dfs -chown -R hrishi /users/hrishi/home

Setting replication factor

hadoop dfs -setrep [-R] <path>

hadoop dfs -setrep -w 3 -R /user/hadoop/dir1

Change group permissions with file

hadoop dfs -chgrp [-R] GROUP URI [URI …]

hadoop dfs -chgrp -R abc /users/abc

Getting the count of files and directories

hadoop dfs -count [-q] <paths>

hadoop dfs -count /users/abc