Book Image

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell
Book Image

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>
Table of Contents (19 chapters)
Apache Solr Enterprise Search Server Third Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

What's next?


You now have an excellent, broad overview of Solr! The numerous features of this tool will no doubt bring the process of implementing a world-class search engine closer to reality. But creating a real, production-ready search solution is a big task. So, where do you begin? As you're getting to know Solr, it might help to think about the main process in three phases: indexing, searching, and application integration.

Schema design and indexing

In what ways do you need your data to be searched? Will you need faceted navigation, spelling suggestions, or more-like-this capabilities? Knowing your requirements up front is the key in producing a well-designed search solution. Understanding how to implement these features is critical. A well-designed schema lays the foundation for a successful Solr implementation.

However, during the development cycle, having the flexibility to try different field types without changing the schema and restarting Solr can be very handy. The dynamic fields feature allows you to assign field types by using field name conventions during indexing. Solr provides many useful predefined dynamic fields. Chapter 2, Schema Design, will cover this in-depth.

However, you can also get started right now. Take a look at the stock dynamic fields in /server/solr/configsets/sample_techproducts_configs/conf/schema.xml. The dynamicField, XML tags represent what is available. For example, the dynamicField named *_b allows you to store and index Boolean data types; a field named admin_b would match this field type.

For the stock dynamic fields, here is a subset of what's available from the schema.xml file:

  • _i: This includes the indexed and stored integers

  • _ss: This includes the stored and indexed, multi-valued strings

  • _dt: This includes the indexed and stored dates

  • _p: This includes the indexed and stored lat/lng types

To make use of these fields, you simply name your fields using those suffixes—example/exampledocs/ipod_other.xml makes good use of the *_dt type with its manufacturedate_dt field. Copying an example file, adding your own data, changing the suffixes, and indexing (via the SimplePost tool) is all as simple as it sounds. Give it a try!

Text analysis

It's probably a good time to talk a little more about text analysis. When considering field types, it's important to understand how your data is processed. For each field, you'll need to know its data type, and whether or not the value should be stored and/or indexed. For string types, you'll also need to think about how the text is analyzed.

Simply put, text analysis is the process of extracting useful information from a text field. This process normally includes two steps: tokenization and filtering. Analyzers encapsulate this entire process, and Solr provides a way to mix and match analyzer behaviors by configuration.

Tokenizers split up text into smaller chunks called tokens. There are many different kinds of tokenizers in Solr, the most common of which splits text on word boundaries, or whitespace. Others split on regular expressions, or even word prefixes. The tokenizer produces a stream of tokens, which can be fed to an optional series of filters.

Filters, as you may have guessed, commonly remove noise—things such as punctuation and duplicate words. Filters can even lower/upper case tokens, and inject word synonyms.

Once the tokens pass through the analyzer processor chain, they are added to the Lucene index. Chapter 2, Schema Design, covers this process in detail.

Searching

The next step is, naturally, searching. For most applications processing user queries, you will want to use the [e]dismax query parser, set with defType=edismax. It is not the default but arguably should be in our opinion; [e]dismax handles end-user queries very well. There are a few more configuration parameters it needs, described in Chapter 5, Searching.

Here are a few example queries to get you thinking.

Tip

Be sure to start up Solr and index the sample data by following the instructions in the previous section.

Find all the documents that have the phrase hard drive in their cat field:

http://localhost:8983/solr/techproducts/select?q=cat:"hard+drive"

Find all the documents that are in-stock, and have a popularity greater than 6:

http://localhost:8983/solr/techproducts/select?q=+inStock:true+AND+popularity:[6+TO+*]

Here's an example using the eDisMax query parser:

http://localhost:8983/solr/techproducts/select?q=ipod&defType=edismax&qf=name^3+manu+cat&fl=*,score

This returns documents where the user query in q matches the name, manu, and cat fields. The ^3 after the manu field tells Solr to boost the relevancy of the document scores when the manu field matches. The fl param tells Solr what fields to return—The * means return all fields, and score is a number that represents how well the document matched the input query.

Faceting and statistics can be seen in this example:

http://localhost:8983/solr/techproducts/select?q=ipod&defType=dismax&qf=name^3+manu+cat&fl=*,score&rows=0&facet=true&facet.field=manu_id_s&facet.field=cat&stats=true&stats.field=price&stats.field=weight

This builds on the previous, dismax example, but instead of returning documents (rows=0), Solr returns multiple facets and stats field values.

For detailed information on searching, see Chapter 5, Searching.

Integration

If the previous tips on indexing and searching are enough to get you started, then you must be wondering how you integrate Solr and your application. By far, the most common approach is to communicate with Solr via HTTP. You can make use of one of the many HTTP client libraries available. Here's a small example using the Ruby library, RSolr:

require "rsolr"
client = RSolr.connect
params = {:q => "ipod", :defType => "dismax", :qf => "name^3 manu cat", :fl => "*,score"}
result = client.select(:params => params)
result["response"]["docs"].each do |doc|
  puts doc.inspect
end

Using one of the previous sample queries, the result of this script would print out each document, matching the query ipod.

There are many client implementations, and finding the right one for you is dependent on the programming language your application is written in. Chapter 9, Integrating Solr, covers this in depth, and will surely set you in the right direction.