Book Image

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell
Book Image

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>
Table of Contents (19 chapters)
Apache Solr Enterprise Search Server Third Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

A quick tour of Solr


Point your browser to Solr's administrative interface at http://localhost:8983/. The admin site is a single-page application that provides access to some of the more important aspects of a running Solr instance.

Tip

The administrative interface is currently being completely revamped, and the below interface may be deprecated.

This tour will help you get your bearings in navigating around Solr.

In the preceding screenshot, the navigation is on the left while the main content is on the right. The left navigation is present on every page of the admin site and is divided into two sections. The primary section contains choices related to higher-level Solr and Java features, while the secondary section lists all of the running Solr cores.

The default page for the admin site is Dashboard. This gives you a snapshot of some basic configuration settings and stats, for Solr, the JVM, and the server. The Dashboard page is divided into the following subareas:

  • Instance: This area displays when Solr started.

  • Versions: This area displays various Lucene and Solr version numbers.

  • JVM: This area displays the Java implementation, version, and processor count. The various Java system properties are also listed here.

  • System: This area displays the overview of memory settings and usage; this is essential information for debugging and optimizing memory settings.

  • JVM-Memory: This meter shows the allocation of JVM memory, and is key to understanding if garbage collection is happening properly. If the dark gray band occupies the entire meter, you will see all sorts of memory related exceptions!

The rest of the primary navigation choices include the following:

  • Logging: This page is a real-time view of logging, showing the time, level, logger, and message. This section also allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty, as we're running it, this output goes to the console and nowhere else. See Chapter 11, Deployment, for more information on configuring logging.

  • Core Admin: This section is for information and controls for managing Solr cores. Here, you can unload, reload, rename, swap, and optimize the selected core. There is also an option for adding a new core.

  • Java Properties: This lists Java system properties, which are basically Java-oriented global environment variables. Including the command used to start the Solr Java process.

  • Thread Dump: This displays a Java thread dump, useful for experienced Java developers in diagnosing problems.

Below the primary navigation is a list of running Solr cores. Click on the Core Selector drop-down menu and select the techproducts link. You should see something very similar to the following screenshot:

The default page labeled Overview for each core shows core statistics, information about replication, an Admin Extra area. Some other options such as details about Healthcheck are grayed out and made visible if the feature is enabled.

You probably noticed the subchoice menu that appeared below techproducts. Here is an overview of what those subchoices provide:

  • Analysis: This is used for diagnosing query and indexing problems related to text analysis. This is an advanced screen and will be discussed later.

  • Data Import: Provides information about the DataImport handler (the DIH). Like replication, it is only useful when DIH is enabled. The DataImport handler will be discussed in more detail in Chapter 4, Indexing Data.

  • Documents: Provides a simple interface for creating a document to index into Solr via the browser. This includes a Document Builder that walks you through adding individual fields of data.

  • Files: Exposes all the files that make up the core's configuration. Everything from core files such as schema.xml and solrconfig.xml to stopwords.txt.

  • Ping: Clicking on this sends a ping request to Solr, displaying the latency. The primary purpose of the ping response is to provide a health status to other services, such as a load balancer. The ping response is a formatted status document and it is designed to fail if Solr can't perform a search query that you provide.

  • Plugins / Stats: Here you will find statistics such as timing and cache hit ratios. In Chapter 10, Scaling Solr, we will visit this screen to evaluate Solr's performance.

  • Query: This brings you to a search form with many options. With or without this search form, you will soon end up directly manipulating the URL using this book as a reference. There's no data in Solr yet, so there's no point in using the form right now.

  • Replication: This contains index replication status information, and the controls for disabling. It is only useful when replication is enabled. More information on this is available in Chapter 10, Scaling Solr.

  • Schema Browser: This is an analytical view of the schema that reflects various statistics of the actual data in the index. We'll come back to this later.

  • Segments Info: Segments are the underlying files that make up the Lucene data structure. As you index information, they expand and compress. This allows you to monitor them, and was newly added to Solr 5.

    Tip

    You can partially customize the admin view by editing a few templates that are provided. The template filenames are prefixed with admin-extra, and are located in the conf directory.

Loading sample data

Solr comes with some sample data found at example/exampledocs. We saw this data loaded as part of creating the techproducts Solr core when we started Solr. We're going to use that for the remainder of this chapter so that we can explore Solr more, without getting into schema design and deeper data loading options. For the rest of the book, we'll base the examples on the digital supplement to the book—more on that later.

We're going to re-index the example data by using the post.jar Java program, officially called SimplePostTool. Most JAR files aren't executable, but this one is. This simple program takes a Java system variable to specify the collection: -Dc=techproducts, iterates over a list of Solr-formatted XML input files, and HTTP posts it to Solr running on the current machine —http://localhost:8983/solr/techproducts/update. Finally, it will send a commit command, which will cause documents that were posted prior to the commit to be saved and made visible. Obviously, Solr must be running for this to work. Here is the command and its output:

>> cd example/exampledocs
>> java –Dc=techproducts -jar post.jar *.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update using content-type application/xml...
POSTing file gb18030-example.xml
POSTing file hd.xml
etc.
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...

If you are using a Unix-like environment, you have an alternate option of using the /bin/post shell script, which wraps the SimplePostTool.

Note

The post.sh and post.jar programs could be used in a production scenario, but they are intended just as a demonstration of the technology with the example data.

Let's take a look at one of these XML files we just posted to Solr, monitor.xml:

<add>
  <doc>
    <field name="id">3007WFP</field>
    <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
    <field name="manu">Dell, Inc.</field>
    <!-- Join -->
    <field name="manu_id_s">dell</field>
    <field name="cat">electronics</field>
    <field name="cat">monitor</field>
    <field name="features">30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast</field>
    <field name="includes">USB cable</field>
    <field name="weight">401.6</field>
    <field name="price">2199</field>
    <field name="popularity">6</field>
    <field name="inStock">true</field>
    <!-- Buffalo store -->
    <field name="store">43.17614,-90.57341</field>
  </doc>
</add>

The XML schema for files that can be posted to Solr is very simple. This file doesn't demonstrate all of the elements and attributes, but it shows the essentials. Multiple documents, represented by the <doc> tag, can be present in series within the <add> tag, which is recommended for bulk data loading scenarios. This subset may very well be all that you use. More about these options and other data loading choices will be discussed in Chapter 4, Indexing Data.

A simple query

Point your browser to http://localhost:8983/solr/#/techproducts/query—this is the query form described in the previous section. The search box is labeled q. This form is a standard HTML form, albeit enhanced by JavaScript. When the form is submitted, the form inputs become URL parameters to an HTTP GET request to Solr. That URL and Solr's search response is displayed to the right. It is convenient to use the form as a starting point for developing a search, but then subsequently refine the URL directly in the browser instead of returning to the form.

Run a query by replacing the *:* in the q field with the word lcd, then clicking on the Execute Query button. At the top of the main content area, you will see a URL like this http://localhost:8983/solr/techproducts/select?q=monitor&wt=json&indent=true. The URL specifies that you want to query for the word lcd, and that the output should be in indented JSON format.

Below this URL, you will see the search result; this result is the response of that URL.

By default, Solr responds in XML, however the query interface specifies JSON by default. Most modern browsers, such as Firefox, provide a good JSON view with syntax coloring and hierarchical controls. All response formats have the same basic structure as the JSON you're about to see. More information on these formats can be found in Chapter 4, Indexing Data.

The JSON response consists of a two main elements: responseHeader and response. Here is what the header element looks like:

"responseHeader": {
    "status": 0,
    "QTime": 1,
    "params": {
      "q": "lcd",
      "indent": "true",
      "wt": "json"
    }
  }
…

The following are the elements from the preceding code snippet:

  • status: This is always zero, unless there was a serious problem.

  • QTime: This is the duration of time in milliseconds that Solr took to process the search. It does not include streaming back the response. Due to multiple layers of caching, you will find that your searches will often complete in a millisecond or less if you've run the query before.

  • params: This lists the request parameters. By default, it only lists parameters explicitly in the URL; there are usually more parameters specified in a <requestHandler/> in solrconfig.xml. You can see all of the applied parameters in the response by setting the echoParams parameter to true.

    Note

    More information on these parameters and many more is available in Chapter 5, Searching.

Next up is the most important part, the results:

"response": {
    "numFound": 5,
    "start": 0,

The numFound value is the number of documents matching the query in the entire index. The start parameter is the index offset into those matching (ordered) documents that are returned in the response below.

Often, you'll want to see the score of each matching document. The document score is a number that represents how relevant the document is to the search query. This search response doesn't refer to scores because it needs to be explicitly requested in the fl parameter—a comma-separated field list. A search that requests the score via fl=*,score will have a maxScore attribute in the "response" element, which is the maximum score of all documents that matched the search. It's independent of the sort order or result paging parameters.

The content of the result element is a list of documents that matched the query. The default sort is by descending score. Later, we'll do some sorting by specified fields.

{
        "id": "9885A004",
        "name": "Canon PowerShot SD500",
        "manu": "Canon Inc.",
        "manu_id_s": "canon",
        "cat": [
          "electronics",
          "camera"
        ],
        "features": [
          "3x zoop, 7.1 megapixel Digital ELPH",
          "movie clips up to 640x480 @30 fps",
          "2.0\" TFT LCD, 118,000 pixels",
          "built in flash, red-eye reduction"
        ],
        "includes": "32MB SD card, USB cable, AV cable, battery",
        "weight": 6.4,
        "price": 329.95,
        "price_c": "329.95,USD",
        "popularity": 7,
        "inStock": true,
        "manufacturedate_dt": "2006-02-13T15:26:37Z",
        "store": "45.19614,-93.90341",
        "_version_": 1500358264225792000
      },
...

The document list is pretty straightforward. By default, Solr will list all of the stored fields. Not all of the fields are necessarily stored—that is, you can query on them but not retrieve their value—an optimization choice. Notice that it uses the basic data types of strings, integers, floats, and Booleans. Also note that certain fields, such as features and cat are multivalued, as indicated by the use of [] to denote an array in JSON.

This was a basic keyword search. As you start using more search features such as faceting and highlighting, you will see additional information following the response element.

Some statistics

Let's take a look at the statistics available via the Plugins / Stats page. This page provides details on all the components of Solr. Browse to CORE and then pick a Searcher. Before we loaded data into Solr, this page reported that numDocs was 0, but now it should be 32.

Now take a look at the update handler stats by clicking on the UPDATEHANDLER and then expand the stats for the update handler by clicking on the updateHandler toggle link on the right-hand side of the screen. Notice that the /update request handler has some stats too:

If you think of Solr as a RESTful server, then the various public end points are exposed under the QUERYHANDLER menu. Solr isn't exactly REST-based, but it is very similar. Look at the /update to see the indexing performance, and /select for query performance.

Note

These statistics are accumulated since when Solr was started or reloaded, and they are not stored to disk. As such, you cannot use them for long-term statistics. There are third-party SaaS solutions referenced in Chapter 11, Deployment, which capture more statistics and persist it long-term.

The sample browse interface

The final destination of our quick Solr tour is to visit the so-called browse interface—available at http://localhost:8983/solr/techproducts/browse. It's for demonstrating various Solr features:

  • Standard keyword search: Here, you can experiment with Solr's syntax.

  • Query debugging: Here, you can toggle display of the parsed query and document score "explain" information.

  • Query-suggest: Here, you can start typing a word like enco and suddenly "encoded" will be suggested to you.

  • Highlighting: Here, the highlighting of query words in search results is in bold, which might not be obvious.

  • More-like-this: This returns documents with similar words.

  • Faceting: This includes field value facets, query facets, numeric range facets, and date range facets.

  • Clustering: This shows how the search results cluster together based on certain words. You must first start Solr as the instructions describe in the lower left-hand corner of the screen.

  • Query boosting: This influences the scores by product price.

  • Geospatial search: Here, you can filter by distance. Click on the spatial link at the top-left to enable this.

This is also a demonstration of Solritas, which formats Solr requests using templates that are based on Apache Velocity. The templates are VM files in example/techproducts/solr/techproducts/conf/velocity. Solritas is primarily for search UI prototyping. It is not recommended for building anything substantial. See Chapter 9, Integrating Solr, for more information.

Note

The browse UI as supplied assumes the default example Solr schema. It will not work out of the box against another schema without modification.

Here is a screenshot of the browse interface; not all of it is captured in this image: