Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

As data grows exponentially day-by-day, extracting information becomes a tedious activity in itself. Technologies like Hadoop are trying to address some of the concerns, while Solr provides high-speed faceted search. Bringing these two technologies together is helping organizations resolve the problem of information extraction from Big Data by providing excellent distributed faceted search capabilities. Scaling Big Data with Hadoop and Solr is a step-by-step guide that helps you build high performance enterprise search engines while scaling data. Starting with the basics of Apache Hadoop and Solr, this book then dives into advanced topics of optimizing search with some interesting real-world use cases and sample Java code. Scaling Big Data with Hadoop and Solr starts by teaching you the basics of Big Data technologies including Hadoop and its ecosystem and Apache Solr. It explains the different approaches of scaling Big Data with Hadoop and Solr, with discussion regarding the applicability, benefits, and drawbacks of each approach. It then walks readers through how sharding and indexing can be performed on Big Data followed by the performance optimization of Big Data search. Finally, it covers some real-world use cases for Big Data scaling. With this book, you will learn everything you need to know to build a distributed enterprise search platform as well as how to optimize this search to a greater extent resulting in maximum utilization of available resources.

Scaling Big Data with Hadoop and Solr

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Processing Big Data Using Hadoop and MapReduce

Understanding Apache Hadoop and its ecosystem

Storing large data in HDFS

Creating MapReduce to analyze Hadoop data

Installing and running Hadoop

Managing a Hadoop cluster

Summary

Understanding Solr

Installing Solr

Apache Solr architecture

Configuring Apache Solr search

Loading your data for search

Summary

Making Big Data Work for Hadoop and Solr

The problem

Understanding data-processing workflows

Using Solr 1045 patch – map-side indexing

Using Solr 1301 patch – reduce-side indexing

Using SolrCloud for distributed search

Using Katta for Big Data search (Solr-1395 patch)

Summary

Using Big Data to Build Your Large Indexing

Understanding the concept of NOSQL

The CAP theorem

Understanding the concepts of distributed search

Lily – running Solr and Hadoop together

Deep dive – shards and indexing data of Apache Solr

Configuring SolrCloud to work with large indexes

Summary

Improving Performance of Search while Scaling with Big Data

Understanding the limits

Optimizing the search schema

Index optimization

Optimization the search runtime

Monitoring the Solr instance

Summary

Use Cases for Big Data Search

E-commerce websites

Log management for banking

Creating Enterprise Search Using Apache Solr

Sample MapReduce Programs to Build the Solr Indexes

The Solr-1045 patch – map program

The Solr-1301 patch – reduce-side indexing

Katta

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Index

A

Apache Ambari
- about / Apache Ambari
Apache Avro
- about / Apache Avro, The architecture
Apache Flume
- about / Apache Flume
Apache Hadoop
- about / Understanding Apache Hadoop and its ecosystem, Distributed search architecture
- components / Understanding Apache Hadoop and its ecosystem
- ecosystem / The ecosystem of Apache Hadoop
Apache HBase
- about / Apache HBase
Apache HCatalog
- about / Apache HCatalog
Apache Hive
- about / Apache Hive
Apache Lucene
- about / Understanding the limits
Apache Mahout
- about / Apache Mahout
Apache Pig
- about / Apache Pig
Apache Solr
- about / The problem
- benefits / The problem
- issues / The problem
Apache Solr instance
- setting up / Setting up the Apache Solr instance
Apache Solr search
- configuring / Configuring Apache Solr search
- schema, defining for instance / Defining a Schema for your instance
- Solr instance, configuring / Configuring a Solr instance
- request handlers / Request handlers and search components
- search components / Request handlers and search components
- facets / Facet
- MoreLikeThis component / MoreLikeThis
- highlight search component / Highlight
- SpellCheck component / SpellCheck
- metadata management / Metadata management
Apache Sqoop
- about / Apache Sqoop
Apache Tika / The query parser
Apache Zookeeper
- about / Apache ZooKeeper
AP system
- about / The CAP theorem
architecture, distributed search / Distributed search architecture
architecture, HDFS / HDFS architecture
- NameNode / NameNode
- DataNode / DataNode
- Secondary NameNode / Secondary NameNode
architecture, Katta / Katta architecture
architecture, Lily / The architecture
- Write-Ahead Log (WAL) / Write-ahead Logging
- message queue / The message queue
- querying / Querying using Lily
- records, updating / Updating records using Lily
architecture, Map-Reduce
- about / MapReduce architecture
- JobTracker / MapReduce architecture, JobTracker
- TaskTracker / MapReduce architecture, TaskTracker
architecture, Solr
- about / Apache Solr architecture
- storage / Storage
architecture, SolrCloud / SolrCloud architecture
autoCommit directive / Configuration files

B

Big Data storage
- Solr, using for / How Solr can be used for Big Data storage?
Brewer's theorem
- about / The CAP theorem

C

Cache Autowarming
- about / Optimizing the Solr cache
capacity-scheduler.xml / Hadoop configuration
CAP theorem
- about / The CAP theorem
- NOSQL database / What is a NOSQL database?
CA system
- about / The CAP theorem
CDH
- about / Apache Flume
checkpoints
- about / Secondary NameNode
client APIs, Solr engine / Client APIs and SolrJ client
Cloudera
- about / Apache Flume
collection
- about / Using SolrCloud for distributed search
collections
- creating, in SolrCloud / Creating shards, collections, and replicas in SolrCloud
column store, NOSQL database / The key-value store or column store
commit console, SolrMeter / Using SolrMeter
commit operation
- about / When to commit changes?
- performing / When to commit changes?
common-logging.properties / Hadoop configuration
components, Apache Hadoop
- HDFS / Understanding Apache Hadoop and its ecosystem
- MapReduce framework / Understanding Apache Hadoop and its ecosystem
- Apache HBase / Apache HBase
- Apache Pig / Apache Pig
- Apache Hive / Apache Hive
- Apache Zookeeper / Apache ZooKeeper
- Apache Mahout / Apache Mahout
- Apache HCatalog / Apache HCatalog
- Apache Ambari / Apache Ambari
- Apache Avro / Apache Avro
- Apache Sqoop / Apache Sqoop
- Apache Flume / Apache Flume
concurrent clients
- optimizing / Optimizing concurrent clients
configuration, Apache Solr search / Configuring Apache Solr search
configuration, Katta cluster / Configuring Katta cluster
configuration, search schema fields / Configuring search schema fields
configuration, SolrCloud / Configuring SolrCloud
configuration, Solr instance / Configuring a Solr instance
configuration files, Solr
- solrconfig.xml / Storage
- schema.xml / Storage
- solr.xml / Storage
- about / Configuration files
container
- optimizing / Optimizing the container
core-site.xml / Hadoop configuration
CP system
- about / The CAP theorem
CSVDocumentConverter class
- about / Using Solr 1301 patch – reduce-side indexing
CSVIndexer class
- about / Using Solr 1301 patch – reduce-side indexing
CSVMapper class
- about / Using Solr 1301 patch – reduce-side indexing
CSVReducer class
- about / Using Solr 1301 patch – reduce-side indexing
curl utility / Installing Solr
currency.txt / Metadata management
custom partitioning
- about / Deep dive – shards and indexing data of Apache Solr

D

data
- organizing / Organizing data
- loading, for search / Loading your data for search
dataDir directive / Configuration files
Data Import Handler (DIH) / The query parser, Loading your data for search
DataNode
- about / DataNode
data processing workflows
- about / Understanding data-processing workflows
- standalone machine / The standalone machine
- distributed setup / Distributed setup
- replicated mode / The replicated mode
- sharded mode / The sharded mode
DDL (Data Definition Language)
- about / Apache HCatalog
default search field
- specifying / Specifying the default search field
DisMaxQueryParser
- about / SolrJ
DisMaxRequestHandler / The query parser
distributed deadlock
- about / Understanding the limits
distributed search
- SolrCloud, using for / Using SolrCloud for distributed search
- about / Understanding the concepts of distributed search
- architecture / Distributed search architecture
- scenarios / Distributed search scenarios
distributed search, Apache Solr
- limitations / Understanding the limits
distributed setup, data processing workflows / Distributed setup
distributed shard
- document, adding to / Adding a document to the distributed shard
document
- about / The document-oriented store
- adding, to distributed shard / Adding a document to the distributed shard
document-oriented store, NOSQL database / The document-oriented store
document cache, Solr cache optimization / The document cache

E

e-commerce websites
- about / E-commerce websites
- benefits / E-commerce websites
elevate.txt / Metadata management
Ephemeral node
- about / The sharding algorithm
ETL (Extract-Transform-Load)
- about / Apache Flume
ExtendedDisMaxQueryParser
- about / SolrJ

F

faceted browsing / The query parser
facets, Apache Solr search / Facet
Fair-scheduler.xml / Hadoop configuration
field value cache, Solr cache optimization / The field value cache
filter cache, Solr cache optimization / The filter cache
filter directive / Configuration files
filter queries
- search runtime, optimizing / Filter queries

G

graph database, NOSQL database / The graph database

H

Hadoop
- operations / Accessing HDFS
- installing / Installing and running Hadoop
- running / Installing and running Hadoop
- prerequisites / Prerequisites
- installing, on machines / Installing Hadoop on machines
- URL / Installing Hadoop on machines
- program, running / Running a program on Hadoop
- search, optimizing / Optimizing search on Hadoop
Hadoop-env.sh / Hadoop configuration
Hadoop-policy.xml / Hadoop configuration
Hadoop cluster
- managing / Managing a Hadoop cluster
Hadoop configuration
- about / Hadoop configuration
- core-site.xml / Hadoop configuration
- hdfs-site.xml / Hadoop configuration
- mapred-site.xml / Hadoop configuration
- common-logging.properties / Hadoop configuration
- capacity-scheduler.xml / Hadoop configuration
- Fair-scheduler.xml / Hadoop configuration
- Hadoop-env.sh / Hadoop configuration
- Hadoop-policy.xml / Hadoop configuration
- Masters/slaves / Hadoop configuration
- Log4j.properties / Hadoop configuration
Hadoop data analysis
- MapReduce, creating for / Creating MapReduce to analyze Hadoop data
HBase / The architecture
HDFS
- large data, storing / Storing large data in HDFS
- architecture / HDFS architecture
- objectives / HDFS architecture
- accessing / Accessing HDFS
HDFS-APIs
- about / Accessing HDFS
hdfs-site.xml / Hadoop configuration
highlight search component, Apache Solr search / Highlight
Hunspell algorithm
- about / Stemming

I

indexConfig directive / Configuration files
indexes
- creating, for Katta / Katta
index handler / The query parser
indexing / Storage
indexing buffer size
- limiting / Limiting the indexing buffer size
index merge
- optimizing / Optimizing the index merge
index optimization
- about / Index optimization
- indexing buffer size, limiting / Limiting the indexing buffer size
- commit operation, performing / When to commit changes?
- index merge, optimizing / Optimizing the index merge
- optimize option, for index merging / Optimize an option for index merging
- container, optimizing / Optimizing the container
- concurrent clients, optimizing / Optimizing concurrent clients
- Java Virtual Machine (JVM), optimizing / Optimizing the Java virtual memory
index partitioning, Apache Solr
- simple partitioning / Deep dive – shards and indexing data of Apache Solr
- prefix-based partitioning / Deep dive – shards and indexing data of Apache Solr
- custom partitioning / Deep dive – shards and indexing data of Apache Solr
index reader / The query parser
installation, Hadoop / Installing and running Hadoop
installation, Lily / Installing and running Lily
installation, Solr / Installing Solr
interaction, Solr engine / Interaction
interfaces, Solr engine / Other interfaces

J

Java Virtual Machine (JVM)
- optimizing / Optimizing the Java virtual memory
JConsole / Monitoring the Solr instance
JCR (Java Content Repository)
- about / The architecture
Jmx directive / Configuration files
JobTracker
- about / JobTracker
JVisualVM / Monitoring the Solr instance

K

Katta
- about / Using Katta for Big Data search (Solr-1395 patch), Katta
- architecture / Katta architecture
- benefits / Benefits
- drawbacks / Drawbacks
- indexes, creating for / Katta
Katta cluster
- configuring / Configuring Katta cluster
Katta indexes
- creating / Creating Katta indexes
key-value store, NOSQL database / The key-value store or column store
KStem algorithm
- about / Stemming

L

laggard problem
- about / Understanding the limits
large data
- storing, in HDFS / Storing large data in HDFS
lazy field loading, Solr cache optimization / Lazy field loading
lib directive / Configuration files
Lily
- about / Lily – running Solr and Hadoop together
- architecture / The architecture
- used, for running user query / Querying using Lily
- used, for updating records / Updating records using Lily
- installing / Installing and running Lily
- running / Installing and running Lily
Lily Data Repository (Lily DR)
- about / The architecture
Listener directive / Configuration files
lockType directive / Configuration files
Log4j.properties / Hadoop configuration
log management, for banking
- about / Log management for banking
- issues / The problem
- issues, tackling / How can it be tackled?
- high-level design / High-level design
luceneMatchVersion directive / Configuration files
LucidWorks
- URL / Installing Solr

M

Map-Reduce
- architecture / MapReduce architecture
map-side indexing / Using Solr 1045 patch – map-side indexing
mapred-site.xml / Hadoop configuration
MapReduce
- about / Understanding Apache Hadoop and its ecosystem
- creating, for Hadoop data analysis / Creating MapReduce to analyze Hadoop data
MapReduce approach
- about / Understanding Apache Hadoop and its ecosystem
MapReduce program
- Solr-1045 patch / The Solr-1045 patch – map program
- Solr-1301 / The Solr-1301 patch – reduce-side indexing
Map Task
- about / Understanding Apache Hadoop and its ecosystem
Masters/slaves / Hadoop configuration
maxBufferedDocs directive / Configuration files
maxIndexingThreads directive / Configuration files
message queue / The message queue
metadata management, Apache Solr search / Metadata management
MongoDB / How Solr can be used for Big Data storage?
MoreLikeThis component, Apache Solr search / MoreLikeThis
multi-core Solr search
- using, on SolrCloud / Using multicore Solr search on SolrCloud

N

NameNode
- about / NameNode
NOSQL database
- key-value store / The key-value store or column store
- column store / The key-value store or column store
- document-oriented store / The document-oriented store
- graph database / The graph database
NOSQL databases
- about / Understanding the concept of NOSQL, What is a NOSQL database?
- need for / Why NOSQL databases for Big Data?

O

OCR
- about / ExtractingRequestHandler/Solr Cell
optimize console, SolrMeter / Using SolrMeter
optimize option
- for index merging / Optimize an option for index merging

P

Pig Latin
- about / Apache Pig
pipeline-based workflow
- about / Understanding data-processing workflows
- advantages / Understanding data-processing workflows
Porter algorithm
- about / Stemming
prefix-based partitioning
- about / Deep dive – shards and indexing data of Apache Solr
program
- running, on Hadoop / Running a program on Hadoop
protwords.txt / Metadata management, protwords.txt

Q

query console, SolrMeter / Using SolrMeter
Query directive / Configuration files
query parser, Solr engine / The query parser
queryParser directive / Configuration files
queryResponseWriter directive / Configuration files
query result cache, Solr cache optimization / The query result cache

R

ramBufferSizeMB directive / Configuration files
records
- updating, Lily used / Updating records using Lily
RecordWriter
- about / The Solr-1301 patch – reduce-side indexing
Reduce Tasks
- about / Understanding Apache Hadoop and its ecosystem
replicas
- creating, in SolrCloud / Creating shards, collections, and replicas in SolrCloud
replicated mode, data processing workflows / The replicated mode
requestDispatcher directive / Configuration files
requestHandler directive / Configuration files
request handlers, Apache Solr search / Request handlers and search components
Response Writer / The query parser

S

schema.xml / Storage, schema.xml
search
- data, loading for / Loading your data for search
- optimizing, on Hadoop / Optimizing search on Hadoop
searchComponent directive / Configuration files
search components, Apache Solr / Request handlers and search components
search query
- search runtime, optimizing / Optimizing through search queries
search runtime
- optimizing / Optimization the search runtime
- optimizing, through search query / Optimizing through search queries
- optimizing, through filter queries / Filter queries
search schema
- optimizing / Optimizing the search schema
search schema fields
- configuring / Configuring search schema fields
search schema optimization
- default search field, specifying / Specifying the default search field
- search schema fields, configuring / Configuring search schema fields
- stop words / Stop words
- stemming / Stemming
Secondary NameNode
- about / Secondary NameNode
sharded mode, data processing workflows / The sharded mode
sharding
- about / Understanding data-processing workflows, Deep dive – shards and indexing data of Apache Solr
Sharding algorithm
- about / The sharding algorithm
shards
- about / Understanding data-processing workflows
- creating, in SolrCloud / Creating shards, collections, and replicas in SolrCloud
simple partitioning
- about / Deep dive – shards and indexing data of Apache Solr
Snowball algorithm
- about / Stemming
Solr
- installing / Installing Solr
- architecture / Apache Solr architecture
- using, for Big Data storage / How Solr can be used for Big Data storage?
Solr-1045 patch
- about / Using Solr 1045 patch – map-side indexing, The Solr-1045 patch – map program
- using / Using Solr 1045 patch – map-side indexing
- URL, for downloading / Using Solr 1045 patch – map-side indexing
- benefits / Benefits
- drawbacks / Drawbacks
Solr-1301
- about / The Solr-1301 patch – reduce-side indexing
- used, for reduce-side indexing / The Solr-1301 patch – reduce-side indexing
solr.war / Installing Solr
solr.xml / Storage
solr.xml file / Configuration files
Solr 1301 patch
- using / Using Solr 1301 patch – reduce-side indexing
- running / Using Solr 1301 patch – reduce-side indexing
- benefits / Benefits
- drawbacks / Drawbacks
Solr cache
- optimizing / Optimizing the Solr cache
Solr cache optimization
- about / Optimizing the Solr cache
- filter cache / The filter cache
- query result cache / The query result cache
- document cache / The document cache
- field value cache / The field value cache
- lazy field loading / Lazy field loading
Solr Cell
- about / ExtractingRequestHandler/Solr Cell
SolrCloud
- about / Using SolrCloud for distributed search
- using, for distributed search / Using SolrCloud for distributed search
- architecture / SolrCloud architecture
- configuring / Configuring SolrCloud
- multi-core Solr search, using on / Using multicore Solr search on SolrCloud
- benefits / Benefits
- drawbacks / Drawbacks
- configuring, for large indexes / Configuring SolrCloud to work with large indexes
- shards, creating / Creating shards, collections, and replicas in SolrCloud
- collections, creating / Creating shards, collections, and replicas in SolrCloud
- replicas, creating / Creating shards, collections, and replicas in SolrCloud
solrconfig.xml / Storage, solrconfig.xml
solrconfig.xml file / Configuration files
SolrDocumentConverter class
- about / Using Solr 1301 patch – reduce-side indexing
Solr engine
- about / Solr engine
- query parser / The query parser
- interaction / Interaction
- client APIs / Client APIs and SolrJ client
- SolrJ client / Client APIs and SolrJ client
- interfaces / Other interfaces
SolrIndexUpdateMapper class / Using Solr 1045 patch – map-side indexing
SolrIndexUpdater class / Using Solr 1045 patch – map-side indexing
Solr instance
- configuring / Configuring a Solr instance
- monitoring / Monitoring the Solr instance
SolrJ
- about / SolrJ
SolrJ client, Solr engine / Client APIs and SolrJ client
SolrMeter
- about / Using SolrMeter
- using / Using SolrMeter
- query console / Using SolrMeter
- update console / Using SolrMeter
- commit console / Using SolrMeter
- optimize console / Using SolrMeter
SolrOutputFormat class
- about / Using Solr 1301 patch – reduce-side indexing
SolrRecordWriter class
- about / Using Solr 1301 patch – reduce-side indexing
SolrXMLDocRecordReader class / Using Solr 1045 patch – map-side indexing
spellcheck component, Apache Solr search / SpellCheck
spellings.txt / Metadata management, spellings.txt
ssh
- setting up, without passphrase / Setting up SSH without passphrases
standalone machine, data processing workflows / The standalone machine
stemming
- about / Stemming
stemming algorithms
- Porter / Stemming
- KStem / Stemming
- Snowball / Stemming
- Hunspell / Stemming
stop words
- about / Stop words
stopwords.txt / Metadata management, stopwords.txt
storage, Apache Solr / Storage
synonyms.txt / Metadata management, synonyms.txt

T

TaskTracker
- about / TaskTracker

U

unlockOnStartup directive / Configuration files
update console, SolrMeter / Using SolrMeter
updateHandler directive / Configuration files
updateLog directive / Configuration files
updateRequestProcessor chain / Configuration files
user query
- running, Lily used / Querying using Lily

W

Write-Ahead Log (WAL)
- about / Write-ahead Logging
writeLockTimeout directive / Configuration files

Z

znodes
- about / The sharding algorithm
ZooKeeper ensemble
- setting up / Setting up the ZooKeeper ensemble

Scaling Big Data with Hadoop and Solr

By : Hrishikesh Vijay Karambelkar

Scaling Big Data with Hadoop and Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Related Content you might be interested in

Current Title:

Scaling Big Data with Hadoop and Solr

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

W

Z