Book Image

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell
Book Image

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>
Table of Contents (19 chapters)
Apache Solr Enterprise Search Server Third Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

An introduction to Solr


Solr is an open source enterprise search server. It is a mature product powering search for public sites such as CNET, Yelp, Zappos, and Netflix, as well as countless other government and corporate intranet sites. It is written in Java, and that language is used to further extend and modify Solr through various extension points. However, being a server that communicates using standards such as HTTP, XML, and JSON, knowledge of Java is useful but not a requirement. In addition to the standard ability to return a list of search results based on a full text search, Solr has numerous other features such as result highlighting, faceted navigation (as seen on most e-commerce sites), query spellcheck, query completion, and a "more-like-this" feature for finding similar documents.

Note

You will see many references in this book to the term faceting, also known as faceted navigation. It's a killer feature of Solr that most people have experienced at major e-commerce sites without realizing it. Faceting enhances search results with aggregated information over all of the documents found in the search. Faceting information is typically used as dynamic navigational filters, such as a product category, date and price groupings, and so on. Faceting can also be used to power analytics. Chapter 7, Faceting, is dedicated to this technology.

Lucene – the underlying engine

Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community. It is the most widely deployed search technology today. Being just a code library, Lucene is not a server and certainly isn't a web crawler either. This is an important fact. There aren't even any configuration files.

In order to use Lucene, you write your own search code using its API, starting with indexing documents that you supply to it. A document in Lucene is merely a collection of fields, which are name-value pairs containing text or numbers. You configure Lucene with a text analyzer that will tokenize a field's text from a single string into a series of tokens (words) and further transform them by reducing them to their stems, called stemming, substitute synonyms, and/or perform other processing. The final indexed tokens are said to be the terms. The aforementioned process starting with the analyzer is referred to as text analysis. Lucene indexes each document into its index stored on a disk. The index is an inverted index, which means it stores a mapping of a field's terms to associated documents, along with the ordinal word position from the original text. Finally, you search for documents with a user-provided query string that Lucene parses according to its syntax. Lucene assigns a numeric relevancy score to each matching document and only the top scoring documents are returned.

Note

This brief description of Lucene internals is what makes Solr work at its core. You will see these important vocabulary words throughout this book—they will be explained further at appropriate times.

Lucene's major features are:

  • An inverted index for efficient retrieval of documents by indexed terms. The same technology supports numeric data with range- and time-based queries too.

  • A rich set of chainable text analysis components, such as tokenizers and language-specific stemmers that transform a text string into a series of terms (words).

  • A query syntax with a parser and a variety of query types, from a simple term lookup to exotic fuzzy matching.

  • A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the best matches first, with flexible means to affect the scoring.

  • Search enhancing features. There are many, but here are some notable ones:

    • A highlighter feature to show matching query terms found in context.

    • A query spellchecker based on indexed content or a supplied dictionary.

    • Multiple suggesters for completing query strings.

    • Analysis components for various languages, faceting, spatial-search, and grouping and joining queries too.

    Note

    To learn more about Lucene, read Lucene In Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodneti, Manning Publications.

Solr – a Lucene-based search server

Apache Solr is an enterprise search server that is based on Lucene. Lucene is such a big part of what defines Solr that you'll see many references to Lucene directly throughout this book. Developing a high-performance, feature-rich application that uses Lucene directly is difficult and it's limited to Java applications. Solr solves this by exposing the wealth of power in Lucene via configuration files and HTTP parameters, while adding some features of its own. Some of Solr's most notable features beyond Lucene are as follows:

  • A server that communicates over HTTP via multiple formats, including XML and JSON

  • Configuration files, most notably for the index's schema, which defines the fields and configuration of their text analysis

  • Several types of caches for faster search responses

  • A web-based administrative interface, including the following:

    • Runtime search and cache performance statistics

    • A schema browser with index statistics on each field

    • A diagnostic tool for debugging text analysis

    • Support for dynamic core (indices) administration

  • Faceting of search results (note: distinct from Lucene's faceting)

  • A query parser called eDisMax that is more usable for parsing end user queries than Lucene's native query parser

  • Distributed search support, index replication, and fail-over for scaling Solr

  • Cluster configuration and coordination using ZooKeeper

  • Solritas—a sample generic web search UI for prototyping and demonstrating many of Solr's search features

Also, there are two contrib modules that ship with Solr that really stand out, which are as follows:

  • DataImportHandler (DIH): A database, e-mail, and file crawling data import capability. It includes a debugger tool.

  • Solr Cell: An adapter to the Apache Tika open source project, which can extract text from numerous file types.

As of the 3.1 release, there is a tight relationship between the Solr and Lucene projects. The source code repository, committers, and developer mailing list are the same, and they are released together using the same version number. Since Solr is always based on the latest version of Lucene, most improvements in Lucene are available in Solr immediately.

Comparison to database technology

There's a good chance that you are unfamiliar with Lucene or Solr and you might be wondering what the fundamental differences are between it and a database. You might also wonder if you use Solr, do you need a database.

The most important comparison to make is with respect to the data model—the organizational structure of the data. The most popular category of databases is relational databases—RDBMS. A defining characteristic of relational databases is a data model, based on multiple tables with lookup keys between them and a join capability for querying across them. That approach has proven to be versatile, being able to satisfy nearly any information-retrieval task in one query.

However, it is hard and expensive to scale them to meet the requirements of a typical search application consisting of many millions of documents and low-latency response. Instead, Lucene has a much more limiting document-oriented data model, which is analogous to a single table. Document-oriented databases such as MongoDB are similar in this respect, but their documents can be nested, similar to XML or JSON. Lucene's document structure is flat like a table, but it does support multivalued fields—a field with an array of values. It can also be very sparse such that the actual fields used from one document to the next vary; there is no space or penalty for a document to not use a field.

Note

Lucene and Solr have limited support for join queries, but they are used sparingly as it significantly reduces the scalability characteristics of Lucene and Solr.

Taking a look at the Solr feature list naturally reveals plenty of search-oriented technology that databases generally either don't have, or don't do well. The notable features are relevancy score ordering, result highlighting, query spellcheck, and query-completion. These features are what drew you to Solr, no doubt. And let's not forget faceting. This is possible with a database, but it's hard to figure out how, and it's difficult to scale. Solr, on the other hand, makes it incredibly easy, and it does scale.

Can Solr be a substitute for your database? You can add data to it and get it back out efficiently with indexes; so on the surface, it seems plausible. The answer is that you are almost always better off using Solr in addition to a database. Databases, particularly RDBMSes, generally excel at ACID transactions, insert/update efficiency, in-place schema changes, multiuser access control, bulk data retrieval, and they have second-to-none integration with application software stacks and reporting tools. And let's not forget that they have a versatile data model. Solr falls short in these areas.

Note

For more on this subject, see our article, Text Search, your Database or Solr, at http://bit.ly/uwF1ps, which although it's slightly outdated now, is a clear and useful explanation of the issues. If you want to use Solr as a document-oriented or key-value NoSQL database, Chapter 4, Indexing Data, will tell you how and when it's appropriate.