Book Image

Administrating Solr

By : Surendra Mohan
Book Image

Administrating Solr

By: Surendra Mohan

Overview of this book

Implementing different search engines on web products is a mandate these days. Apache Solr is a robust search engine, but simply implementing Apache Solr and forgetting about it is not a good idea, especially when you have to fight for the search ranking of your web product. In such a scenario, you need to keep monitoring, administrating, and optimizing your Solr to retain your ranking. "Administrating Solr" is a practical, hands-on guide. This book will provide you with a number of clear, step-by-step exercises and some advanced concepts which will help you administrate, monitor, and optimize Solr using Drupal and associated scripts. Administrating Solr will also provide you with a solid grounding on how you can use Apache Solr with Drupal. "Administrating Solr" starts with an overview of Apache Solr and the installation process to get you familiar with Solr. It then gradually moves on to discuss the mysteries that make Solr flexible enough to render appropriate search results in different scenarios. This book will take you through clear and practical concepts that will help you monitor, administrate, and optimize your Solr appropriately using both scripts and tools. This book will also teach you ways to query your search and methods to keep your Solr healthy and well maintained. With this book, you will learn how to effectively implement and optimize Solr using Drupal.
Table of Contents (12 chapters)

Faceted search


One of the advantages of Solr is the ability to group results on the basis of the field's contents. This ability to group results using Solr is defined as faceting which can help us in several tasks that we need to do in our everyday work. For instance, getting the number of documents with the same values in a field (such as the companies from the same city) through the ability of value and ranges grouping, to the autocomplete features based on faceting. In this section, I will show you how to handle some of the important and common tasks when using faceting.

Search based on the same value range

You have an application that allows the users to search for companies in Europe (for instance), and imagine a situation where your customer wants to have the number of companies in the cities where the companies that were found by the query are located. Just think how frustrating it would be to run several queries to do this. Don't panic, Solr will relieve your frustration and will make this task much easier by using faceting. Let me show you how to do it.

Let us assume that we have the following index structure which we have added to our field definition section of our schema.xml file; we will use the city field to do the faceting:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored="true" /> 
<field name="city" type="string" indexed="true" stored="true" />

And our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">Company 1</field> 
<field name="city">New York</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="name">Company 2</field> 
<field name="city">California</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="name">Company 3</field> 
<field name="city">New York</field> 
</doc> 
</add>

Let us suppose that a user searches for the word company. The query will look like this:

http://localhost:8080/solr/select?q=name:company&facet=true&facet. field=city

The result produced by this query looks like:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="facet">true</str> 
<str name="facet.field">city</str> 
<str name="q">name:company</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0"> 
<doc> 
<str name="city">New York</str> 
<str name="id">1</str> 
<str name="name">Company 1</str> 
</doc> 
<doc> 
<str name="city">California</str> 
<str name="id">2</str> 
<str name="name">Company 2</str> 
</doc> 
<doc> 
<str name="city">New York</str> 
<str name="id">3</str> 
<str name="name">Company 3</str> 
</doc> 
</result> 
<lst name="facet_counts"> 
<lst name="facet_queries"/> 
<lst name="facet_fields"> 
<lst name="city"> 
<int name="New York">2</int> 
<int name="California">1</int> 
</lst> 
</lst> 
<lst name="facet_dates"/>
</lst> 
</response>

Note

Notice that, besides the normal results list, we got the faceting results with the numbers that we wanted.

The index structure and data are quite simple and the field we would like to focus on is the city field based on which we would like to fetch the number of companies having the same value of this city field.

We query Solr and inform the query parser that we want the documents that have the word company in the title field and indicate that we also wish to enable faceting by using the facet=true parameter. The facet.field parameter tells Solr which field to use to calculate the faceting numbers.

Note

You are open to specify the facet.field parameter multiple times to get the faceting numbers for different fields in the same query.

As you can see in the results list, all types of faceting are grouped in the list with the name="facet_counts" attribute. The field based faceting is grouped under the list with the name="facet_fields" attribute. Every field that you specified using the facet.field parameter has its own list which has the name attribute same as the value of the parameter in the query (in our case, city). Finally, we see the results that we are interested in: the pairs of values (the name attribute) and how many documents have that value in the specified field.

Filter your facet results

Imagine a situation where you need to search for books in your eStore or library. If this was only the situation, it would have been very simple to search. Just think of the adds-on of showing the book count which lies between a specific price range! Can Solr handle such a complex situation? I would answer yes, and here we go.

Suppose that we have the following index structure which has been added to field definition section of our schema.xml; we will use the price field to do the faceting:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored=
  "true" /> 
<field name="price" type="float" indexed="true" stored="true" />

Here is our example data:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">Book 1</field> 
<field name="price">70</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="name">Book 2</field> 
<field name="price">100</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="name">Book 3</field> 
<field name="price">210.95</field> 
</doc> 
<doc> 
<field name="id">4</field> 
<field name="name">Book 4</field> 
<field name="price">99.90</field> 
</doc> 
</add>

Let us assume that the user searches for a book and wishes to fetch the document count within the price range of 60 to 100 or 200 to 250.

Our query will look like this:

http://localhost:8080/solr/select?q=name:book&facet=true&facet. query=price:[60 TO 100]&facet.query=price:[200 TO 250]

The result list of our query would look like this:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="facet">true</str> 
<arr name="facet.query"> 
<str>price:[60 TO 100]</str> 
<str>price:[200 TO 250]</str> 
</arr> 
<str name="q">name:book</str> 
</lst> 
</lst> 
<result name="response" numFound="4" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="name">Book 1</str> 
<float name="price">70.0</float> 
</doc> 
<doc> 
<str name="id">2</str> 
<str name="name">Book 2</str> 
<float name="price">100.0</float> 
</doc> 
<doc> 
<str name="id">3</str> 
<str name="name">Book 3</str> 
<float name="price">210.95</float> 
</doc> 
<doc> 
<str name="id">4</str>
<str name="name">Book 4</str> 
<float name="price">99.9</float> 
</doc> 
</result> 
<lst name="facet_counts"> 
<lst name="facet_queries"> 
<int name="price:[60 TO 100]">3</int> 
<int name="price:[200 TO 250]">1</int> 
</lst> 
<lst name="facet_fields"/> 
<lst name="facet_dates"/> 
</lst> 
</response>

As you can see, the index structure is quite simple and we have already discussed it earlier. So, let's omit it here for now.

Next is the query I would like you to pay special attention to. We see a standard query where we instruct Solr that we want to get all the documents that have the word book in the name field (the q=name:book parameter). Then, we say that we want to use faceting by adding the facet=true parameter to the query, that is, we can now pass the query to faceting and as a result, we expect the number of documents that match the given query; in our case, we want two price ranges: 60 to 100 and 200 to 250.

We achieved it by adding the facet.query parameter with the appropriate value. The first price range is defined as a standard range query (price:[60 TO 100]). The second query is very similar, just different values where we define the other price range (price:[200 TO 250]).

Note

The value passed to the facet.query parameter should be a lucene query written using the default query syntax.

As you can see in the result list, the query faceting results are grouped under the <lst name="facet_queries"> XML tag with the names exactly as in the passed queries. You can see that Solr calculated the number of books in each of the price ranges appropriately, which proved to be a perfect solution to our assumption.

Autosuggest feature using faceting

Imagine that when a user types a keyword to search for a book title on your Web based library and suggestions based on the typed keyword pop up to the user helping him/her choose the appropriate search keyword! We have most of the known search engines implementing features such as autocomplete or autosuggest. Why don't you? Yes of course, and the next example will guide you on how to implement such a robust feature.

Let us consider the following index structure which needs to be added in the field definition section of our schema.xml file.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" /> 
<field name="title_autocomplete" type="lowercase" indexed="true" stored="true">

We also wish to add some field copying to automate some of the operations. To do so, we will add the following after the field definition section in our schema.xml file:

<copyField source="title" dest="title_autocomplete" />

We will then add the lower case field type definition in the types definition section of our schema.xml file, which will look like this:

<fieldType name="lowercase" class="solr.TextField"> 
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/> 
<filter class="solr.LowerCaseFilterFactory" /> 
</analyzer> 
</fieldType>

Our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Lucene or Solr ?</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">My Solr and the rest of the world</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">Solr recipes</field> 
</doc> 
<doc> 
<field name="id">4</field> 
<field name="title">Solr cookbook</field> 
</doc> 
</add>

Now, let us assume that user typed the letters so in the search box, and we wish to give him/her the first 10 suggestions with the highest counts. We also wish to give suggestions of the whole titles instead of just the single words. To do so, send the following query to Solr:

http://localhost:8080/solr/select?q=*:*&rows=0&facet=true&facet. field=title_autocomplete&facet.prefix=so

And here we go with the result of this query:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">16</int> 
<lst name="params"> 
<str name="facet">true</str> 
<str name="q">*:*</str> 
<str name="facet.prefix">so</str> 
<str name="facet.field">title_autocomplete</str> 
        <str name="rows">0</str>
</lst> 
</lst> 
<result name="response" numFound="4" start="0"/> 
<lst name="facet_counts"> 
<lst name="facet_queries"/> 
<lst name="facet_fields"> 
<lst name="title_autocomplete"> 
<int name="solr cookbook">1</int> 
<int name="solr recipes">1</int> 
</lst> 
</lst> 
<lst name="facet_dates"/> 
</lst> 
</response>

You can see that our index structure looks more or less as the one we have been using, except for the additional autosuggest field which is used to provide autosuggest feature.

We have the copy field section to automatically copy the contents of the title field to the title_autocomplete field.

We used the lowercase field type to provide the autocomplete feature regardless of the case of the letter typed by the user (lower or upper).

Now it is time to analyze the query. As you can see we are searching the whole index (the parameter q=*:*), but we are not interested in any search results (the rows=0 parameter). We instruct Solr that we want to use the faceting mechanism (the facet=true parameter) and that it will be a field based faceting on the basis of the title_autocomplete field (the facet. field=title_autocomplete parameter). The last parameter, the facet.prefix can be something new. Basically, it tells Solr to return only those faceting results that are beginning with the prefix specified as the value of this parameter, which in our case is the value of so. The use of this parameter enables us to show the suggestions that the user is interested in; and we can see from the results achieved what we had intended.

Tip

It is recommended not to use heavily analyzed text (for example, stemmed text) to ensure that your word isn't modified frequently.