Book Image

Apache Solr 4 Cookbook

By : Rafał Kuć
Book Image

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

<p>Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features.<br /><br />"Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data.<br /><br />"Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration.<br /><br />With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.</p>
Table of Contents (18 chapters)
Apache Solr 4 Cookbook
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

How to implement a product's autocomplete functionality


The autocomplete functionality is very popular now. You can find it in most e-commerce sites, on Google, Bing, and so on. It enables your users or clients to find what they want and do it fast. In most cases, the autocomplete functionality also increases the relevance of your search by pointing to the right author, title, and so on, right away without looking at the search results. What's more, sites that use autocomplete report higher revenue after deploying it in comparison to the situation before implementing it. Seems like a win-win situation, both for you and your clients. So, let's look at how we can implement a product's autocomplete functionality in Solr.

How to do it...

Let's assume that we want to show the full product name whenever our users enter a part of the word that the product name is made up of. In addition to this, we want to show the number of documents with the same names.

  1. Let's start with an example data that is going to be indexed:

    <add>
      <doc>
        <field name="id">1</field>
        <field name="name">First Solr 4.0 CookBook</field>
      </doc>
      <doc>
        <field name="id">2</field>
        <field name="name">Second Solr 4.0 CookBook</field>
      </doc>
    </add>
  2. We will need two main fields in the index – one for the document identifier and one for the name. We will need two additional fields – one for autocomplete and one for faceting that we will use. So, our index structure will look similar to the following code snippet (we should add it to the schema.xml fields section):

    <field name="id" type="string" indexed="true" 
      stored="true" required="true" />
    <field name="name" type="text" indexed="true" 
      stored="true" />
    <field name="name_autocomplete" type="text_autocomplete" 
      indexed="true" stored="false" />
    <field name="name_show" type="string" indexed="true" 
      stored="false" />
  3. In addition to this, we want Solr to automatically copy the data from the name field to the name_autocomplete and name_show fields. So, we should add the following copy fields section to the schema.xml file:

    <copyField source="name" dest="name_autocomplete"/>
    <copyField source="name" dest="name_show"/>
  4. Now, the final thing about the schema.xml file — that is, the text_autocomplete field type — it should look similar to the following code snippet (place it in the types section of the schema.xml file):

    <fieldType name="text_autocomplete" 
      class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" 
          minGramSize="1" maxGramSize="25" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
  5. That's all. Now, if we would like to show all the products that start with the word sol to our users, we would send the following query:

    curl 'http://localhost:8983/solr/select?q=name_autocomplete:sol&q.op=AND&rows=0&&facet=true&facet.field=name_show&facet.mincount=1&facet.limit=5'
    

    The response returned by Solr would be as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
      <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
          <str name="facet">true</str>
          <str name="fl">name</str>
          <str name="facet.mincount">1</str>
          <str name="q">name_autocomplete:sol</str>
          <str name="facet.limit">5</str>
          <str name="q.op">AND</str>
          <str name="facet.field">name_show</str>
          <str name="rows">0</str>
        </lst>
      </lst>
        <result name="response" numFound="2" start="0">
        </result>
        <lst name="facet_counts">
        <lst name="facet_queries"/>
        <lst name="facet_fields">
          <lst name="name_show">
            <int name="First Solr 4.0 CookBook">1</int>
            <int name="Second Solr 4.0 CookBook">1</int>
          </lst>
        </lst>
        <lst name="facet_dates"/>
        <lst name="facet_ranges"/>
      </lst>
    </response>

    As you can see, the faceting results returned by Solr are exactly what we were looking for. So now, let's see how it works.

How it works...

Our example documents are pretty simple – they are only built of an identifier and a name that we will use to make autocomplete. The index structure is where things are getting interesting. The first two fields are the ones that you would have expected – they are used to hold the identifier of the document and its name. However, we have two additional fields available; the name_autocomplete field that will be used for querying and name_show that will be used for faceting. The name_show field is based on a string type, because we want to have a single token per name when using faceting.

With the use of the copy field sections, we can let Solr automatically copy the values of the fields defined by the source attribute to the field defined by the dest field. Copying is done before any analysis.

The name_autocomplete field is based on the text_autocomplete field type, which is defined differently for indexing and querying. During query time, we divide the entered query on the basis of white space characters using solr.WhitespaceTokenizerFactory, and we lowercase the tokens with the use of solr.LowerCaseFilterFactory. For query time, this is what we want because we don't want any more processing. For index time, we not only use the same tokenizer and filter, but also solr.NGramFilterFactory. This is because we want to allow our users to efficiently search for prefixes, so that when someone enters the word sol, we would like to show all the products that have a word starting with that prefix, and solr.NGramFilterFactory allows us to do that. For the word solr, it will produce the tokens s, so, sol, and solr.

We've also said that we are interested in grams starting from a single character (the minGramsSize property) and the maximum size of grams allowed is 25 (the maxGramSize property).

Now comes the query. As you can see, we've sent the prefix of the word that the users have entered to the name_autocomplete field (q=name_autocomplete:sol). In addition to this, we've also said that we want words in our query to be connected with the logical AND operator (the q.op parameter), and that we are not interested in the search results (the rows=0 parameter). As we said, we will use faceting for our autocomplete functionality, because we need the information about the number of documents with the same titles, so we've turned faceting on (the facet=true parameter). We said that we want to calculate the faceting on our name_show field (the facet.field=name_show parameter). We are also only interested in faceting a calculation for the values that have at least one document in them (facet.mincount=1), and we want the top five results (facet.limit=5).

As you can see, we've got two distinct values in the faceting results; both with a single document with the same title, which matches our sample data.