Book Image

Solr 1.4 Enterprise Search Server

By : David Smiley, Eric Pugh
Book Image

Solr 1.4 Enterprise Search Server

By: David Smiley, Eric Pugh

Overview of this book

<p>If you are a developer building a high-traffic web site, you need to have a terrific search engine. Sites like Netflix.com and Zappos.com employ Solr, an open source enterprise search server, which uses and extends the Lucene search library. This is the first book in the market on Solr and it will show you how to optimize your web site for high volume web traffic with full-text search capabilities along with loads of customization options. So, let your users gain a terrific search experience.<br /><br />This book is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate it with other languages and frameworks.<br /><br />This book first gives you a quick overview of Solr, and then gradually takes you from basic to advanced features that enhance your search. It starts off by discussing Solr and helping you understand how it fits into your architecture—where all databases and document/web crawlers fall short, and Solr shines. The main part of the book is a thorough exploration of nearly every feature that Solr offers. To keep this interesting and realistic, we use a large open source set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project. Using this data as a testing ground for Solr, you will learn how to import this data in various ways from CSV to XML to database access. You will then learn how to search this data in a myriad of ways, including Solr's rich query syntax, "boosting" match scores based on record data and other means, about searching across multiple fields with different boosts, getting facets on the results, auto-complete user queries, spell-correcting searches, highlighting queried text in search results, and so on.<br /><br />After this thorough tour, we'll demonstrate working examples of integrating a variety of technologies with Solr such as Java, JavaScript, Drupal, Ruby, XSLT, PHP, and Python.<br /><br />Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.</p>
Table of Contents (15 chapters)
Solr 1.4 Enterprise Search Server
Credits
About the Authors
About the Reviewers
Preface
Index

The schema and configuration files


Solr's configuration files are extremely well documented. We're not going to go over the details here but this should give you a sense of what is where.

The schema (defined in schema.xml) contains field type definitions (defined within the <types> tag) and lists the fields that make up your schema (within the <fields> tag), which references a type. The schema contains other information too such as the primary key (the field that uniquely identifies each document—a constraint that Solr enforces) and the default search field. The sample schema in Solr uses the field named text, confusingly, there is a field type named text too. But remember that the monitor.xml document we reviewed earlier had no field named text, right? It is common for the schema to call out for certain fields to be copied to other fields—particularly fields not in input documents. So, even though the input documents don't have a field named text, there are <copyField> tags in the schema, which call for the fields named cat, name, manu, features, and includes to be copied to text. This is a popular technique to speed up queries, so that queries can search over a small number of fields rather than a long list of them. Such fields used this way are rarely stored, as they are just needed for querying and so are indexed. There is a lot more we could talk about in the schema, but we're going to move on for now.

Solr's solrconfig.xml file contains lots of parameters that can be tweaked. At the moment, we're just going to take a peak at the request handlers that are defined with <requestHandler> tags. They make up about half of the file. In our first query, we didn't specify any request handler, so we got the default one. It's defined here:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<!-- default values for query parameters -->
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <!-- 
    <int name="rows">10</int>
    <str name="fl">*</str>
    <str name="version">2.1</str>
    -->
  </lst>
</requestHandler>

When you POST commands to Solr (such as to index a document) or query Solr (HTTP GET), it goes through a particular request handler. Handlers can be registered against certain URL paths. When we uploaded the documents earlier, it went to the handler defined like this:

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

The request handlers oriented to querying using the class solr.SearchHandler are much more interesting.

Note

The important thing to realize about using a request handler is that they are nearly completely configurable through URL parameters or POST'ed form parameters. They can also be specified in solrconfig.xml within either default, appends, or invariants named lst blocks, which serve to establish defaults. More on this is in Chapter 4. This arrangement allows you to set up a request handler for a particular application that will be querying Solr without forcing the application to specify all of its query options.

The standard request handler defined previously doesn't really define any defaults other than the parameters that are to be echoed in the response. Remember its presence at the top of the XML output? By changing explicit to none you can have it omitted, or use all and you'll potentially see more parameters, if other defaults happened to be configured in the request handler. This parameter can alternatively be specified in the URL through echoParams=none. Remember to separate URL parameters with ampersands.