Book Image

Scaling Apache Solr

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Apache Solr

By: Hrishikesh Vijay Karambelkar

Overview of this book

Table of Contents (18 chapters)
Scaling Apache Solr
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Apache Solr architecture


In the previous section, we have gone through various key features supported by Apache Solr. In this section, we will look at the architecture of Apache Solr. Apache Solr is a J2EE-based application that internally uses Apache Lucene libraries to generate the indexes as well as to provide a user friendly search. Let's look at the Solr architecture diagram as follows:

The Apache Solr instance can run as a single core or multicore; it is a client-server model. In case of a multicore, however, the search access pattern can differ. We are going to look into this in the next chapter. Earlier, Apache Solr had a single core, which in turn, limited the consumers to run Solr on one application through a single schema and configuration file. Later, the support for creating multiple cores was added. With this support, one can run one Solr instance for multiple schemas and configurations with unified administrations. For high availability and scalability requirements, Apache Solr can run in a distributed mode. We are going to look at it in Chapter 6, Distributed Search Using Apache Solr. There are four logical layers in which the overall Solr functionality can be divided. The storage layer is responsible for management of indexes and configuration metadata. The container is the J2EE container on which the instance will run, and Solr engine is the application package that runs on top of the container, and, finally, the interaction talks about how clients/browser can interact with Apache Solr server. Let's look at each of the components in detail in the upcoming sections.

Storage

The storage of Apache Solr is mainly used for storing metadata and the actual index information. It is typically a file store, locally configured in the configuration of Apache Solr. The default Solr installation package comes with a Jetty servlet and HTTP server, the respective configuration can be found in the $solr.home/conf folder of Solr installation. An index contains a sequence of documents. Additionally, external storage devices can be configured in Apache Solr, such as databases or Big Data storage systems. The following are the components:

  • A document is a collection of fields

  • A field is a named sequence of terms

  • A term is a string

The same string in two different fields is considered a different term. The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. Apache Solr (underlying Lucene) indexing is a specially designed data structure, stored in the filesystem as a set of index files. The index is designed with a specific format in such a way to maximize query performance.

Note

Inverted index is an index data structure for storing mapping from data to actual words and numbers to its location on the storage disk. The following are the strings:

Str[1] = "This is a game of team"
Str[2]="I do not like a game of cricket"
Str[3]="People play games everyday"

We have the following inverted file index:

This {1}
Game {1,2,3}
Of {1,2}

Solr application

There are two major functions that Solr supports—indexing and searching. Initially, the data is uploaded to Apache Solr through various means; there are handlers to handle data within specific category (XML, CSV, PDF, database, and so on). Once the data is uploaded, it goes through a stage of cleanup called update processor chain. In this chain, initially, the de-duplication phase can be used to remove duplicates in the data to avoid them from appearing in the index unnecessarily. Each update handler can have its own update processor chain that can do document-level operations prior to indexing, or even redirect indexing to a different server or create multiple documents (or zero) from a single one. The data is then transformed depending upon the type.

Apache Solr can run in a master-slave mode. Index replicator is responsible for distributing indexes across multiple slaves. The master server maintains index update and the slaves are responsible for talking with the master to get them replicated for high availability. Apache Lucene core gets packages as library with the Apache Solr application. It provides core functionality for Solr such as index, query processing, searching data, ranking matched results, and returning them back.

Apache Lucene comes with a variety of query implementations. Query parser is responsible for parsing the queries passed by the end search as a search string. Lucene provides TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, RangeQuery, MultiTermQuery, FilteredQuery, SpanQuery, and so on as query implementations.

Index searcher is a basic component of Solr searched with a default base searcher class. This class is responsible for returning ordered match results of searched keyword ranked as per the computed score. The index reader provides access to indexes stored in the filesystem. It can be used to search for an index. Similar to the index searcher, an index writer allows you to create and maintain indexes in Apache Lucene.

The analyzer is responsible for examining the fields and generating tokens. Tokenizer breaks field data into lexical units or tokens. The filter examines the field of tokens from the tokenizer and either it keeps them and transforms them, or discards them and creates new ones. Tokenizers and filters together form a chain or pipeline of analyzers. There can only be one tokenizer per analyzer. The output of one chain is fed to another. Analyzing the process is used for indexing as well as querying by Solr. They play an important role in speeding up the query as well as index time; they also reduce the amount of data that gets generated out of these operations. You can define your own customer analyzers depending upon your use case. In addition to the analyzer, Apache Solr allows administrators to make the search experience more effective by means of taking out common words such as is, and, and are through the stopwords feature. Solr supports synonyms, thereby not limiting search to purely text match. Through the processing of stemming, all words such as played, playing, play can be transformed into the base form. We are going to look at these features in the coming chapters and the appendix. Similar to stemming, the user can search for multiterms of a single word as well (for example, play, played, playing). When a user fires a search query on Solr, it actually gets passed on to a request handler. By default, Apache Solr provides DisMaxRequestHandler. You can visit http://wiki.apache.org/solr/DisMaxRequestHandler to find more details about this handler. Based on the request, the request handler calls the query parser. You can see an example of the filter in the following figure:

The query parser is responsible for parsing the queries, and converting it to Lucene query objects. There are different types of parsers available (Lucene, DisMax, eDisMax, and so on). Each parser offers different functionalities and it can be used based on the requirements. Once a query is parsed, it hands it over to the index searcher. The job of the index reader is to run the queries on the index store and gather the results to the response writer.

The response writer is responsible for responding back to the client; it formats the query response based on the search outcomes from the Lucene engine. The following figure displays the complete process flow when a search is fired from the client:

Apache Solr ships with an example schema that runs using Apache velocity. Apache velocity is a fast open source templates engine, which quickly generates HTML-based frontend. Users can customize these templates as per their requirements, although it is not used for production in many cases.

Index handlers are a type of update handler, handling the task of add, update, and delete function on documents for indexing. Apache Solr supports updates through the index handler through JSON, XML, and text format.

Data Import Handler (DIH) provides a mechanism for integrating different data sources with Apache Solr for indexing. The data sources could be relational databases or web-based sources (for example, RSS, ATOM feeds, and e-mails).

Tip

Although DIH is a part of Solr development, the default installation does not include it in the Solr application; they need to be included in the application explicitly.

Apache Tika, a project in itself extends capabilities of Apache Solr to run on top of different types of files. When a document is assigned to Tika, it automatically determines the type of file, that is, Word, Excel, PDF and extracts the content. Tika also extracts document metadata such as author, title, and creation date, which if provided in schema, go as text field in Apache Solr. This can later be used as facets for the search interface.

Integration

Apache Solr, although a web-based application, can be integrated with different technologies. So, if a company has Drupal-based e-commerce sites, they can integrate the Apache Solr application and provide its rich-faceted search to the user. It can also support advanced searches using the range search.

Client APIs and SolrJ client

The Apache Solr client provides different ways of talking with Apache Solr web application. This enables Solr to easily get integrated with any application. Using client APIs, consumers can run a search, and perform different operations on indexes. The Solr Java (SolrJ) client is an interface of Apache Solr with Java. The SolrJ client enables any Java application to talk directly with Solr through its extensive library of APIs. Apache SolrJ is a part of the Apache Solr package.

Other interfaces

Apache Solr can be integrated with other various technologies using its API library and standards-based interfacing. JavaScript-based clients can straightaway talk with Solr using JSON-based messaging. Similarly, other technologies can simply connect to the Apache Solr running instance through HTTP, and consume its services either through JSON, XML, or text formats. Since Solr can be interacted through standard ways, clients can always build their own pretty user interface to interact with the Solr server.