Apache Solr Enterprise Search Server

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Buy this Book

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Buy this Book

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>

Apache Solr Enterprise Search Server Third Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Quick Starting Solr

An introduction to Solr

A few differences between Solr 4 and Solr 5

Resources outside this book

Summary

Schema Design

Is Solr schemaless?

MusicBrainz.org

One combined index or separate indices

Schema design

The schema.xml file

Summary

Text Analysis

Configuring field types

Character filters

Tokenization

Filtering

The multilingual search

Summary

Indexing Data

Communicating with Solr

Solr's Update-XML format

Commit, optimize, and rollback the transaction log

Atomic updates and optimistic concurrency

Sending CSV-formatted data to Solr

The DataImportHandler framework

Indexing documents with Solr Cell

Update request processors

Summary

Searching

Your first search – a walk-through

Solr's generic XML structured data representation

Solr's XML response format

Understanding request handlers

Query parameters

Query parsers and local-params

Query syntax (the lucene query parser)

The DisMax query parser – part 1

Filtering

Sorting

Joining

Spatial search

Summary

Search Relevancy

Scoring

The DisMax query parser – part 2

Functions and function queries

Summary

Faceting

A quick example – faceting release types

Field requirements

Types of faceting

Faceting field values

Faceting numeric and date ranges

Facet queries

Building a filter query from a facet

Pivot faceting

Excluding filters – multiselect faceting

Summary

Search Components

About components

The highlight component

The SpellCheck component

Query complete/suggest

The QueryElevation component

The MoreLikeThis component

The Stats component

The Clustering component

Collapsing and expanding

The TermVector component

Summary

Integrating Solr

Working with the included examples

Solritas – the integrated search UI

SolrJ – Solr's Java client API

Using JavaScript/AJAX with Solr

Using XSLT to transform XML search results

Accessing Solr from PHP applications

Ruby on Rails integrations

Nutch for crawling web pages

Solr and Hadoop

ManifoldCF – a connector framework

Document-level security

Summary

Scaling Solr

Tuning complex systems is hard

Use SolrMeter to test Solr performance

Optimizing a single Solr server – scale up

Configuring Solr for near real-time search

Use SolrCloud to go big – scale wide

Summary

Deployment

Deployment methodology for Solr

Installing Solr into a Servlet container

Configuring logging

A RequestHandler per search interface

Leveraging Solr cores

Setting up ZooKeeper for SolrCloud

Monitoring Solr performance

Securing Solr from prying eyes

Summary

Quick Reference

Core search

Diagnostic

The Lucene query parser

The DisMax query parser

The Lucene query syntax

Faceting

Highlighting

Spell checking

Miscellaneous nonsearch

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

An introduction to Solr

Solr is an open source enterprise search server. It is a mature product powering search for public sites such as CNET, Yelp, Zappos, and Netflix, as well as countless other government and corporate intranet sites. It is written in Java, and that language is used to further extend and modify Solr through various extension points. However, being a server that communicates using standards such as HTTP, XML, and JSON, knowledge of Java is useful but not a requirement. In addition to the standard ability to return a list of search results based on a full text search, Solr has numerous other features such as result highlighting, faceted navigation (as seen on most e-commerce sites), query spellcheck, query completion, and a "more-like-this" feature for finding similar documents.

Note

You will see many references in this book to the term faceting, also known as faceted navigation. It's a killer feature of Solr that most people have experienced at major e-commerce sites without realizing it. Faceting enhances search results with aggregated information over all of the documents found in the search. Faceting information is typically used as dynamic navigational filters, such as a product category, date and price groupings, and so on. Faceting can also be used to power analytics. Chapter 7, Faceting, is dedicated to this technology.

Lucene – the underlying engine

Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community. It is the most widely deployed search technology today. Being just a code library, Lucene is not a server and certainly isn't a web crawler either. This is an important fact. There aren't even any configuration files.

In order to use Lucene, you write your own search code using its API, starting with indexing documents that you supply to it. A document in Lucene is merely a collection of fields, which are name-value pairs containing text or numbers. You configure Lucene with a text analyzer that will tokenize a field's text from a single string into a series of tokens (words) and further transform them by reducing them to their stems, called stemming, substitute synonyms, and/or perform other processing. The final indexed tokens are said to be the terms. The aforementioned process starting with the analyzer is referred to as text analysis. Lucene indexes each document into its index stored on a disk. The index is an inverted index, which means it stores a mapping of a field's terms to associated documents, along with the ordinal word position from the original text. Finally, you search for documents with a user-provided query string that Lucene parses according to its syntax. Lucene assigns a numeric relevancy score to each matching document and only the top scoring documents are returned.

Note

This brief description of Lucene internals is what makes Solr work at its core. You will see these important vocabulary words throughout this book—they will be explained further at appropriate times.

Lucene's major features are:

An inverted index for efficient retrieval of documents by indexed terms. The same technology supports numeric data with range- and time-based queries too.
A rich set of chainable text analysis components, such as tokenizers and language-specific stemmers that transform a text string into a series of terms (words).
A query syntax with a parser and a variety of query types, from a simple term lookup to exotic fuzzy matching.
A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the best matches first, with flexible means to affect the scoring.
Search enhancing features. There are many, but here are some notable ones:
- A highlighter feature to show matching query terms found in context.
- A query spellchecker based on indexed content or a supplied dictionary.
- Multiple suggesters for completing query strings.
- Analysis components for various languages, faceting, spatial-search, and grouping and joining queries too.
Note
To learn more about Lucene, read Lucene In Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodneti, Manning Publications.

Solr – a Lucene-based search server

Apache Solr is an enterprise search server that is based on Lucene. Lucene is such a big part of what defines Solr that you'll see many references to Lucene directly throughout this book. Developing a high-performance, feature-rich application that uses Lucene directly is difficult and it's limited to Java applications. Solr solves this by exposing the wealth of power in Lucene via configuration files and HTTP parameters, while adding some features of its own. Some of Solr's most notable features beyond Lucene are as follows:

A server that communicates over HTTP via multiple formats, including XML and JSON
Configuration files, most notably for the index's schema, which defines the fields and configuration of their text analysis
Several types of caches for faster search responses
A web-based administrative interface, including the following:
- Runtime search and cache performance statistics
- A schema browser with index statistics on each field
- A diagnostic tool for debugging text analysis
- Support for dynamic core (indices) administration
Faceting of search results (note: distinct from Lucene's faceting)
A query parser called eDisMax that is more usable for parsing end user queries than Lucene's native query parser
Distributed search support, index replication, and fail-over for scaling Solr
Cluster configuration and coordination using ZooKeeper
Solritas—a sample generic web search UI for prototyping and demonstrating many of Solr's search features

Also, there are two contrib modules that ship with Solr that really stand out, which are as follows:

DataImportHandler (DIH): A database, e-mail, and file crawling data import capability. It includes a debugger tool.
Solr Cell: An adapter to the Apache Tika open source project, which can extract text from numerous file types.

As of the 3.1 release, there is a tight relationship between the Solr and Lucene projects. The source code repository, committers, and developer mailing list are the same, and they are released together using the same version number. Since Solr is always based on the latest version of Lucene, most improvements in Lucene are available in Solr immediately.

Comparison to database technology

There's a good chance that you are unfamiliar with Lucene or Solr and you might be wondering what the fundamental differences are between it and a database. You might also wonder if you use Solr, do you need a database.

The most important comparison to make is with respect to the data model—the organizational structure of the data. The most popular category of databases is relational databases—RDBMS. A defining characteristic of relational databases is a data model, based on multiple tables with lookup keys between them and a join capability for querying across them. That approach has proven to be versatile, being able to satisfy nearly any information-retrieval task in one query.

However, it is hard and expensive to scale them to meet the requirements of a typical search application consisting of many millions of documents and low-latency response. Instead, Lucene has a much more limiting document-oriented data model, which is analogous to a single table. Document-oriented databases such as MongoDB are similar in this respect, but their documents can be nested, similar to XML or JSON. Lucene's document structure is flat like a table, but it does support multivalued fields—a field with an array of values. It can also be very sparse such that the actual fields used from one document to the next vary; there is no space or penalty for a document to not use a field.

Note

Lucene and Solr have limited support for join queries, but they are used sparingly as it significantly reduces the scalability characteristics of Lucene and Solr.

Taking a look at the Solr feature list naturally reveals plenty of search-oriented technology that databases generally either don't have, or don't do well. The notable features are relevancy score ordering, result highlighting, query spellcheck, and query-completion. These features are what drew you to Solr, no doubt. And let's not forget faceting. This is possible with a database, but it's hard to figure out how, and it's difficult to scale. Solr, on the other hand, makes it incredibly easy, and it does scale.

Can Solr be a substitute for your database? You can add data to it and get it back out efficiently with indexes; so on the surface, it seems plausible. The answer is that you are almost always better off using Solr in addition to a database. Databases, particularly RDBMSes, generally excel at ACID transactions, insert/update efficiency, in-place schema changes, multiuser access control, bulk data retrieval, and they have second-to-none integration with application software stacks and reporting tools. And let's not forget that they have a versatile data model. Solr falls short in these areas.

Note

For more on this subject, see our article, Text Search, your Database or Solr, at http://bit.ly/uwF1ps, which although it's slightly outdated now, is a clear and useful explanation of the issues. If you want to use Solr as a document-oriented or key-value NoSQL database, Chapter 4, Indexing Data, will tell you how and when it's appropriate.

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr Enterprise Search Server - Third Edition

An introduction to Solr

Note

Lucene – the underlying engine

Note

Note

Solr – a Lucene-based search server

Comparison to database technology

Note

Note