Apache Solr 4 Cookbook

Apache Solr 4 Cookbook

By : Rafał Kuć

Buy this Book

Apache Solr 4 Cookbook

By: Rafał Kuć

Buy this Book

Overview of this book

Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features. "Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data. "Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration. With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

Apache Solr 4 Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Apache Solr Configuration

Introduction

Running Solr on Jetty

Running Solr on Apache Tomcat

Installing a standalone ZooKeeper

Clustering your data

Choosing the right directory implementation

Configuring spellchecker to not use its own index

Solr cache configuration

How to fetch and index web pages

How to set up the extracting request handler

Changing the default similarity implementation

Indexing Your Data

Introduction

Indexing PDF files

Generating unique fields automatically

Extracting metadata from binary files

How to properly configure Data Import Handler with JDBC

Indexing data from a database using Data Import Handler

How to import data using Data Import Handler and delta query

How to use Data Import Handler with the URL data source

How to modify data while importing with Data Import Handler

Updating a single field of your document

Handling multiple currencies

Detecting the document's language

Optimizing your primary key field indexing

Analyzing Your Text Data

Introduction

Storing additional information using payloads

Eliminating XML and HTML tags from text

Copying the contents of one field to another

Changing words to other words

Splitting text by CamelCase

Splitting text by whitespace only

Making plural words singular without stemming

Lowercasing the whole string

Storing geographical points in the index

Stemming your data

Preparing text to perform an efficient trailing wildcard search

Splitting text by numbers and non-whitespace characters

Using Hunspell as a stemmer

Using your own stemming dictionary

Protecting words from being stemmed

Querying Solr

Introduction

Asking for a particular field value

Sorting results by a field value

How to search for a phrase, not a single word

Boosting phrases over words

Positioning some documents over others on a query

Positioning documents with words closer to each other first

Sorting results by a distance from a point

Getting documents with only a partial match

Affecting scoring with functions

Nesting queries

Modifying returned documents

Using parent-child relationships

Ignoring typos in terms of performance

Detecting and omitting duplicate documents

Using field aliases

Returning a value of a function in the results

Using the Faceting Mechanism

Introduction

Getting the number of documents with the same field value

Getting the number of documents with the same value range

Getting the number of documents matching the query and subquery

Removing filters from faceting results

Sorting faceting results in alphabetical order

Implementing the autosuggest feature using faceting

Getting the number of documents that don't have a value in the field

Having two different facet limits for two different fields in the same query

Using decision tree faceting

Calculating faceting for relevant documents in groups

Improving Solr Performance

Introduction

Paging your results quickly

Configuring the document cache

Configuring the query result cache

Configuring the filter cache

Improving Solr performance right after the startup or commit operation

Caching whole result pages

Improving faceting performance for low cardinality fields

What to do when Solr slows down during indexing

Analyzing query performance

Avoiding filter caching

Controlling the order of execution of filter queries

Improving the performance of numerical range queries

In the Cloud

Introduction

Creating a new SolrCloud cluster

Setting up two collections inside a single cluster

Managing your SolrCloud cluster

Understanding the SolrCloud cluster administration GUI

Distributed indexing and searching

Increasing the number of replicas on an already live cluster

Stopping automatic document distribution among shards

Using Additional Solr Functionalities

Introduction

Getting more documents similar to those returned in the results list

Highlighting matched words

How to highlight long text fields and get good performance

Sorting results by a function value

Searching words by how they sound

Ignoring defined words

Computing statistics for the search results

Checking the user's spelling mistakes

Using field values to group results

Using queries to group results

Using function queries to group results

Dealing with Problems

Introduction

How to deal with too many opened files

How to deal with out-of-memory problems

How to sort non-English languages properly

How to make your index smaller

Diagnosing Solr problems

How to avoid swapping

Real-life Situations

Introduction

How to implement a product's autocomplete functionality

How to implement a category's autocomplete functionality

How to use different query parsers in a single query

How to get documents right after they were sent for indexation

How to search your data in a near real-time manner

How to get the documents with all the query words to the top of the results set

How to boost documents based on their publishing date

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

How to implement a product's autocomplete functionality

The autocomplete functionality is very popular now. You can find it in most e-commerce sites, on Google, Bing, and so on. It enables your users or clients to find what they want and do it fast. In most cases, the autocomplete functionality also increases the relevance of your search by pointing to the right author, title, and so on, right away without looking at the search results. What's more, sites that use autocomplete report higher revenue after deploying it in comparison to the situation before implementing it. Seems like a win-win situation, both for you and your clients. So, let's look at how we can implement a product's autocomplete functionality in Solr.

How to do it...

Let's assume that we want to show the full product name whenever our users enter a part of the word that the product name is made up of. In addition to this, we want to show the number of documents with the same names.

Let's start with an example data that is going to be indexed:

<add>
  <doc>
    <field name="id">1</field>
    <field name="name">First Solr 4.0 CookBook</field>
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="name">Second Solr 4.0 CookBook</field>
  </doc>
</add>

We will need two main fields in the index – one for the document identifier and one for the name. We will need two additional fields – one for autocomplete and one for faceting that we will use. So, our index structure will look similar to the following code snippet (we should add it to the schema.xml fields section):
```
<field name="id" type="string" indexed="true" 
 stored="true" required="true" />
<field name="name" type="text" indexed="true" 
 stored="true" />
<field name="name_autocomplete" type="text_autocomplete" 
 indexed="true" stored="false" />
<field name="name_show" type="string" indexed="true" 
 stored="false" />
```
In addition to this, we want Solr to automatically copy the data from the name field to the name_autocomplete and name_show fields. So, we should add the following copy fields section to the schema.xml file:
```
<copyField source="name" dest="name_autocomplete"/>
<copyField source="name" dest="name_show"/>
```

Now, the final thing about the schema.xml file — that is, the text_autocomplete field type — it should look similar to the following code snippet (place it in the types section of the schema.xml file):

<fieldType name="text_autocomplete" 
  class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" 
      minGramSize="1" maxGramSize="25" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

That's all. Now, if we would like to show all the products that start with the word sol to our users, we would send the following query:

curl 'http://localhost:8983/solr/select?q=name_autocomplete:sol&q.op=AND&rows=0&&facet=true&facet.field=name_show&facet.mincount=1&facet.limit=5'

The response returned by Solr would be as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="fl">name</str>
      <str name="facet.mincount">1</str>
      <str name="q">name_autocomplete:sol</str>
      <str name="facet.limit">5</str>
      <str name="q.op">AND</str>
      <str name="facet.field">name_show</str>
      <str name="rows">0</str>
    </lst>
  </lst>
    <result name="response" numFound="2" start="0">
    </result>
    <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="name_show">
        <int name="First Solr 4.0 CookBook">1</int>
        <int name="Second Solr 4.0 CookBook">1</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>

As you can see, the faceting results returned by Solr are exactly what we were looking for. So now, let's see how it works.

How it works...

Our example documents are pretty simple – they are only built of an identifier and a name that we will use to make autocomplete. The index structure is where things are getting interesting. The first two fields are the ones that you would have expected – they are used to hold the identifier of the document and its name. However, we have two additional fields available; the name_autocomplete field that will be used for querying and name_show that will be used for faceting. The name_show field is based on a string type, because we want to have a single token per name when using faceting.

With the use of the copy field sections, we can let Solr automatically copy the values of the fields defined by the source attribute to the field defined by the dest field. Copying is done before any analysis.

The name_autocomplete field is based on the text_autocomplete field type, which is defined differently for indexing and querying. During query time, we divide the entered query on the basis of white space characters using solr.WhitespaceTokenizerFactory, and we lowercase the tokens with the use of solr.LowerCaseFilterFactory. For query time, this is what we want because we don't want any more processing. For index time, we not only use the same tokenizer and filter, but also solr.NGramFilterFactory. This is because we want to allow our users to efficiently search for prefixes, so that when someone enters the word sol, we would like to show all the products that have a word starting with that prefix, and solr.NGramFilterFactory allows us to do that. For the word solr, it will produce the tokens s, so, sol, and solr.

We've also said that we are interested in grams starting from a single character (the minGramsSize property) and the maximum size of grams allowed is 25 (the maxGramSize property).

Now comes the query. As you can see, we've sent the prefix of the word that the users have entered to the name_autocomplete field (q=name_autocomplete:sol). In addition to this, we've also said that we want words in our query to be connected with the logical AND operator (the q.op parameter), and that we are not interested in the search results (the rows=0 parameter). As we said, we will use faceting for our autocomplete functionality, because we need the information about the number of documents with the same titles, so we've turned faceting on (the facet=true parameter). We said that we want to calculate the faceting on our name_show field (the facet.field=name_show parameter). We are also only interested in faceting a calculation for the values that have at least one document in them (facet.mincount=1), and we want the top five results (facet.limit=5).

As you can see, we've got two distinct values in the faceting results; both with a single document with the same title, which matches our sample data.

Apache Solr 4 Cookbook

By : Rafał Kuć

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr 4 Cookbook

How to implement a product's autocomplete functionality

How to do it...

How it works...