Scaling Big Data with Hadoop and Solr

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Buy this Book

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Buy this Book

Overview of this book

As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code. Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling. With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.

Scaling Big Data with Hadoop and Solr

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Processing Big Data Using Hadoop and MapReduce

Understanding Apache Hadoop and its ecosystem

Storing large data in HDFS

Creating MapReduce to analyze Hadoop data

Installing and running Hadoop

Managing a Hadoop cluster

Summary

Understanding Solr

Installing Solr

Apache Solr architecture

Configuring Apache Solr search

Loading your data for search

Summary

Making Big Data Work for Hadoop and Solr

The problem

Understanding data-processing workflows

Using Solr 1045 patch – map-side indexing

Using Solr 1301 patch – reduce-side indexing

Using SolrCloud for distributed search

Using Katta for Big Data search (Solr-1395 patch)

Summary

Using Big Data to Build Your Large Indexing

Understanding the concept of NOSQL

The CAP theorem

Understanding the concepts of distributed search

Lily – running Solr and Hadoop together

Deep dive – shards and indexing data of Apache Solr

Configuring SolrCloud to work with large indexes

Summary

Improving Performance of Search while Scaling with Big Data

Understanding the limits

Optimizing the search schema

Index optimization

Optimization the search runtime

Monitoring the Solr instance

Summary

Use Cases for Big Data Search

E-commerce websites

Log management for banking

Creating Enterprise Search Using Apache Solr

Sample MapReduce Programs to Build the Solr Indexes

The Solr-1045 patch – map program

The Solr-1301 patch – reduce-side indexing

Katta

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 1. Processing Big Data Using Hadoop and MapReduce

Traditionally computation has been processor driven. As the data grew, the industry was focused towards increasing processor speed and memory for getting better performances for computation. This gave birth to the distributed systems. In today's real world, different applications create hundreds and thousands of gigabytes of data every day. This data comes from disparate sources such as application software, sensors, social media, mobile devices, logs, and so on. Such huge data is difficult to operate upon using standard available software for data processing. This is mainly because the data size grows exponentially with time. Traditional distributed systems were not sufficient to manage the big data, and there was a need for modern systems that could handle heavy data load, with scalability and high availability. This is called Big Data.

Big data is usually associated with high volume and heavily growing data with unpredictable content. A video gaming industry needs to predict the performance of over 500 GB of data structure, and analyze over 4 TB of operational logs every day; many gaming companies use Big Data based technologies to do so. An IT advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity of processing speed, and high variety of information). IBM added fourth V (high veracity) to its definition to make sure the data is accurate, and helps you make your business decisions.

While the potential benefits of big data are real and significant, there remain many challenges. So, organizations which deal with such high volumes of data face the following problems:

Data acquisition: There is lot of raw data that gets generated out of various data sources. The challenge is to filter and compress the data, and extract the information out of it once it is cleaned.
Information storage and organization: Once the information is captured out of raw data, the data model will be created and stored in a storage device. To store a huge dataset effectively, traditional relational system stops being effective at such a high scale. There has been a new breed of databases called NOSQL databases, which are mainly used to work with big data. NOSQL databases are non-relational databases.
Information search and analytics: Storing data is only a part of building a warehouse. Data is useful only when it is computed. Big data is often noisy, dynamic, and heterogeneous. This information is searched, mined, and analyzed for behavioral modeling.
Data security and privacy: While bringing in linked data from multiple sources, organizations need to worry about data security and privacy at the most.

Big data offers lot of technology challenges to the current technologies in use today. It requires large quantities of data processing within the finite timeframe, which brings in technologies such as massively parallel processing (MPP) technologies and distributed file systems.

Big data is catching more and more attention from various organizations. Many of them have already started exploring it. Recently Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which reveals that Big Data and analytics are among the top 10 business priorities for CIOs. Similarly, analytics and BI stand as the top priority for CIO's technical priorities. We will try to understand Apache Hadoop in this chapter. We will cover the following:

Understanding Apache Hadoop and its ecosystem
Storing large data in HDFS
Creating MapReduce to analyze the Hadoop data
Installing and running Hadoop
Managing and viewing a Hadoop cluster
Administration tools

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr

Chapter 1. Processing Big Data Using Hadoop and MapReduce