Book Image

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

<p>As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities.</p> <p>Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code.</p> <p>Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling.</p> <p>With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.</p>
Table of Contents (15 chapters)
Scaling Big Data with Hadoop and Solr
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Understanding Apache Hadoop and its ecosystem


Google faced the problem of storing and processing big data, and they came up with the MapReduce approach, which is basically a divide-and-conquer strategy for distributed data processing.

Note

A programming task which is divided into multiple identical subtasks, and which is distributed among multiple machines for processing, is called a map task.. The results out of these map tasks are combined together into one or many reduce tasks. Overall this approach of computing tasks is called a MapReduce approach.

MapReduce is widely accepted by many organizations to run their Big Data computations. Apache Hadoop is the most popular open source Apache licensed implementation of MapReduce. Apache Hadoop is based on the work done by Google in the early 2000s, more specifically on papers describing the Google file system published in 2003, and MapReduce published in 2004. Apache Hadoop enables distributed processing of large datasets across a commodity of clustered servers. It is designed to scale up from single server to thousands of commodity hardware machines, each offering partial computational units and data storage.

Apache Hadoop mainly consists of two major components:

  • The Hadoop Distributed File System (HDFS)

  • The MapReduce software framework

HDFS is responsible for storing the data in a distributed manner across multiple Hadoop cluster nodes. The MapReduce framework provides rich computational APIs for developers to code, which eventually run as map and reduce tasks on the Hadoop cluster.

The ecosystem of Apache Hadoop

Understanding Apache Hadoop ecosystem enables us to effectively apply the concepts of the MapReduce paradigm at different requirements. It also provides end-to-end solutions to various problems that are faced by us every day.

Apache Hadoop ecosystem is vast in nature. It has grown drastically over the time due to different organizations contributing to this open source initiative. Due to the huge ecosystem, it meets the needs of different organizations for high performance analytics. To understand the ecosystem, let's look at the following diagram:

Note

Apache Hadoop ecosystem consists of the following major components:

  • Core Hadoop framework: HDFS and MapReduce

  • Metadata management: HCatalog

  • Data storage and querying: HBase, Hive, and Pig

  • Data import/export: Flume, Sqoop

  • Analytics and machine learning: Mahout

  • Distributed coordination: Zookeeper

  • Cluster management: Ambari

  • Data storage and serialization: Avro

Apache HBase

HDFS is append-only file system; it does not allow data modification. Apache HBase is a distributed, random-access, and column-oriented database. HBase directly runs on top of HDFS, and it allows application developers to read/write the HDFS data directly. HBase does not support SQL; hence, it is also called as NOSQL database. However, it provides command-line-based interface, as well as a rich set of APIs to update the data. The data in HBase gets stored as key-value pairs in HDFS.

Apache Pig

Apache Pig provides another abstraction layer on top of MapReduce. It provides something called Pig Latin, which is a programming language that creates MapReduce programs using Pig. Pig Latin is a high-level language for developers to write high-level software for analyzing data. Pig code generates parallel execution tasks, therefore effectively uses the distributed Hadoop cluster. Pig was initially developed at Yahoo! Research to enable developers create ad-hoc MapReduce jobs for Hadoop. Since then, many big organizations such as eBay, LinkedIn, and Twitter have started using Apache Pig.

Apache Hive

Apache Hive provides data warehouse capabilities using Big Data. Hive runs on top of Apache Hadoop, and uses HDFS for storing its data. The Apache Hadoop framework is difficult to understand, and it requires a different approach from traditional programming to write MapReduce-based programs. With Hive, developers do not write MapReduce at all. Hive provides a SQL like query language called HiveQL to application developers, enabling them to quickly write ad-hoc queries similar to RDBMS SQL queries.

Apache ZooKeeper

Apache Hadoop nodes communicate with each other through Apache Zookeeper. It forms the mandatory part of Apache Hadoop ecosystem. Apache Zookeeper is responsible for maintaining coordination among various nodes. Besides coordinating among nodes, it also maintains configuration information, and group services to the distributed system. Apache ZooKeeper can be used independent of Hadoop, unlike other components of the ecosystem. Due to its in-memory management of information, it offers the distributed coordination at a high speed.

Apache Mahout

Apache Mahout is an open source machine learning software library that can effectively empower Hadoop users with analytical capabilities such as clustering, data mining, and so on, over distributed Hadoop cluster. Mahout is highly effective over large datasets, the algorithms provided by Mahout are highly optimized to run the MapReduce framework over HDFS.

Apache HCatalog

Apache HCatalog provides metadata management services on top of Apache Hadoop. It means all the software that runs on Hadoop can effectively use HCatalog to store their schemas in HDFS. HCatalog helps any third party software to create, edit, and expose (using rest APIs) the generated metadata or table definitions. So, any user or script can run Hadoop effectively without actually knowing where the data is physically stored on HDFS. HCatalog provides DDL (Data Definition Language) commands with which the requested MapReduce, Pig, and Hive jobs can be queued for execution, and later monitored for progress as and when required.

Apache Ambari

Apache Ambari provides a set of tools to monitor Apache Hadoop cluster hiding the complexities of the Hadoop framework. It offers features such as installation wizard, system alerts and metrics, provisioning and management of Hadoop cluster, job performances, and so on. Ambari exposes RESTful APIs for administrators to allow integration with any other software.

Apache Avro

Since Hadoop deals with large datasets, it becomes very important to optimally process and store the data effectively on the disks. This large data should be efficiently organized to enable different programming languages to read large datasets Apache Avro helps you to do that. Avro effectively provides data compression and storages at various nodes of Apache Hadoop. Avro-based stores can easily be read using scripting languages as well as Java. Avro provides dynamic access to data, which in turn allows software to access any arbitrary data dynamically. Avro can be effectively used in the Apache Hadoop MapReduce framework for data serialization.

Apache Sqoop

Apache Sqoop is a tool designed to do load large datasets in Hadoop efficiently. Apache Sqoop allows application developers to import/export easily from specific data sources such as relational databases, enterprise data warehouses, and custom applications. Apache Sqoop internally uses a map task to perform data import/export effectively on Hadoop cluster. Each mapper loads/unloads slice of data across HDFS and data source. Apache Sqoop establishes connectivity between non-Hadoop data sources and HDFS.

Apache Flume

Apache Flume provides a framework to populate Hadoop with data from nonconventional data sources. Typical use of Apache Flume could be for log aggregation. Apache Flume is a distributed data collection service that gets flow of data from their sources, aggregates them, and puts them in HDFS. Most of the time, Apache Flume is used as an ETL (Extract-Transform-Load) utility at various implementation of the Hadoop cluster.

We have gone through the complete ecosystem of Apache Hadoop. These components together make Hadoop one of the most powerful distributed computing software available today for use. Many companies offer commercial implementations and support for Hadoop. Among them is the Cloudera software, a company that provides Apache Hadoop's open source distribution, also called CDH (Cloudera distribution including Apache Hadoop), enables organizations to have commercial Hadoop setup with support. Similarly, companies such as IBM, Microsoft, MapR, and Talend provide implementation and support for the Hadoop framework.