Apache Solr 4 Cookbook

Apache Solr 4 Cookbook

By : Rafał Kuć

Buy this Book

Apache Solr 4 Cookbook

By: Rafał Kuć

Buy this Book

Overview of this book

Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features. "Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data. "Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration. With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

Apache Solr 4 Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Apache Solr Configuration

Introduction

Running Solr on Jetty

Running Solr on Apache Tomcat

Installing a standalone ZooKeeper

Clustering your data

Choosing the right directory implementation

Configuring spellchecker to not use its own index

Solr cache configuration

How to fetch and index web pages

How to set up the extracting request handler

Changing the default similarity implementation

Indexing Your Data

Introduction

Indexing PDF files

Generating unique fields automatically

Extracting metadata from binary files

How to properly configure Data Import Handler with JDBC

Indexing data from a database using Data Import Handler

How to import data using Data Import Handler and delta query

How to use Data Import Handler with the URL data source

How to modify data while importing with Data Import Handler

Updating a single field of your document

Handling multiple currencies

Detecting the document's language

Optimizing your primary key field indexing

Analyzing Your Text Data

Introduction

Storing additional information using payloads

Eliminating XML and HTML tags from text

Copying the contents of one field to another

Changing words to other words

Splitting text by CamelCase

Splitting text by whitespace only

Making plural words singular without stemming

Lowercasing the whole string

Storing geographical points in the index

Stemming your data

Preparing text to perform an efficient trailing wildcard search

Splitting text by numbers and non-whitespace characters

Using Hunspell as a stemmer

Using your own stemming dictionary

Protecting words from being stemmed

Querying Solr

Introduction

Asking for a particular field value

Sorting results by a field value

How to search for a phrase, not a single word

Boosting phrases over words

Positioning some documents over others on a query

Positioning documents with words closer to each other first

Sorting results by a distance from a point

Getting documents with only a partial match

Affecting scoring with functions

Nesting queries

Modifying returned documents

Using parent-child relationships

Ignoring typos in terms of performance

Detecting and omitting duplicate documents

Using field aliases

Returning a value of a function in the results

Using the Faceting Mechanism

Introduction

Getting the number of documents with the same field value

Getting the number of documents with the same value range

Getting the number of documents matching the query and subquery

Removing filters from faceting results

Sorting faceting results in alphabetical order

Implementing the autosuggest feature using faceting

Getting the number of documents that don't have a value in the field

Having two different facet limits for two different fields in the same query

Using decision tree faceting

Calculating faceting for relevant documents in groups

Improving Solr Performance

Introduction

Paging your results quickly

Configuring the document cache

Configuring the query result cache

Configuring the filter cache

Improving Solr performance right after the startup or commit operation

Caching whole result pages

Improving faceting performance for low cardinality fields

What to do when Solr slows down during indexing

Analyzing query performance

Avoiding filter caching

Controlling the order of execution of filter queries

Improving the performance of numerical range queries

In the Cloud

Introduction

Creating a new SolrCloud cluster

Setting up two collections inside a single cluster

Managing your SolrCloud cluster

Understanding the SolrCloud cluster administration GUI

Distributed indexing and searching

Increasing the number of replicas on an already live cluster

Stopping automatic document distribution among shards

Using Additional Solr Functionalities

Introduction

Getting more documents similar to those returned in the results list

Highlighting matched words

How to highlight long text fields and get good performance

Sorting results by a function value

Searching words by how they sound

Ignoring defined words

Computing statistics for the search results

Checking the user's spelling mistakes

Using field values to group results

Using queries to group results

Using function queries to group results

Dealing with Problems

Introduction

How to deal with too many opened files

How to deal with out-of-memory problems

How to sort non-English languages properly

How to make your index smaller

Diagnosing Solr problems

How to avoid swapping

Real-life Situations

Introduction

How to implement a product's autocomplete functionality

How to implement a category's autocomplete functionality

How to use different query parsers in a single query

How to get documents right after they were sent for indexation

How to search your data in a near real-time manner

How to get the documents with all the query words to the top of the results set

How to boost documents based on their publishing date

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

How to get the documents with all the query words to the top of the results set

One of the most common problems that users struggle with when using Apache Solr is how to improve the relevancy of their results. Of course, relevancy tuning is, in most cases, connected to your business needs, but one of the common problems is to have documents that have all the query words in their fields at the top of the results list. You can imagine a situation where you search for all the documents that match at least a single query word, but you would like to show the ones with all the query words first. This recipe will show you how to achieve that.

How to do it...

This recipe will show how we can get the documents with all the query words to the top of the results set.

Let's start with the following index structure (add it to the field section in your schema.xml file):

<field name="id" type="string" indexed="true" 
  stored="true" required="true" />
<field name="name" type="text" indexed="true" 
  stored="true" />
<field name="description" type="text" indexed="true" 
  stored="true" />

The second step is to index the following sample data:

<add>
  <doc>
    <field name="id">1</field>
    <field name="name">Solr and all the others</field>
    <field name="description">This is about Solr</field>
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="name">Lucene and all the others</field>
    <field name="description">
      This is a book about Solr and Lucene
    </field>
  </doc>
</add>

Let's assume that our usual queries look similar to the following code snippet:

http://localhost:8983/solr/select?q=solr book&defType=edismax&mm=1&qf=name^10000+description

Nothing complicated; however, the results of such query don't satisfy us, because they look similar to the following code snippet:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
    <lst name="params">
      <str name="qf">name^10000 description</str>
      <str name="mm">1</str>
      <str name="q">solr book</str>
      <str name="defType">edismax</str>
    </lst>
  </lst>
  <result name="response" numFound="2" start="0">
    <doc>
      <str name="id">1</str>
      <str name="name">Solr and all the others</str>
      <str name="description">This is about Solr</str>
    </doc>
    <doc>
      <str name="id">2</str>
      <str name="name">Lucene and all the others</str>
     <str name="description">
        This is a book about Solr and Lucene
      </str>
    </doc>
  </result>
</response>

In order to change this, let's introduce a new handler in our solrconfig.xml file:

<requestHandler name="/better" 
  class="solr.StandardRequestHandler">
  <lst name="defaults">
    <str name="indent">true</str>
    <str name="q">
      _query_:"{!edismaxqf=$qfQuery mm=$mmQuerypf=
        $pfQuerybq=$boostQuery v=$mainQuery}"
    </str>
    <str name="qfQuery">name^100000 description</str>
    <str name="mmQuery">1</str>
    <str name="pfQuery">name description</str>
    <str name="boostQuery">
      _query_:"{!edismaxqf=$boostQueryQf mm=100% 
        v=$mainQuery}"^100000
    </str>
    <str name="boostQueryQf">name description</str>
  </lst>
</requestHandler>

So, let's send a query to our new handler:

http://localhost:8983/solr/better?mainQuery=solr book

We get the following results:

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">2</int>
  </lst>
  <result name="response" numFound="2" start="0">
    <doc>
      <str name="id">2</str>
      <str name="name">Lucene and all the others</str>
      <str name="description">
        This is a book about Solr and Lucene
      </str>
    </doc>
    <doc>
      <str name="id">1</str>
      <str name="name">Solr and all the others</str>
      <str name="description">This is about Solr</str>
    </doc>
  </result>
</response>

As you can see, it works. So let's discuss how.

How it works...

For the purpose of the recipe, we've used a simple index structure that consists of a document identifier, its name, and description. Our data is very simple as well; it just contains two documents.

During the first query, the document with the identifier 1 is placed at the top of the query results. However, what we would like to achieve is be able to boost the name. In addition to this, we would like to have the documents with words from the query close to each other at the top of the results.

In order to do this, we've defined a new request handler named /better, which will leverage the local params. The first thing is the defined q parameter, which is the standard query. It uses the Extended DisMax parser (the {!edismax part of the query), and defines several additional parameters:

qf: This defines the fields against which edismax should send the query. We tell Solr that we will provide the fields by specifying the qfQuery parameter by using the $qfQuery value.
mm: This is the "minimum should match" parameter, which tells edismax how many words from the query should be found in a document for the document to be considered a match. We tell Solr that we will provide the fields by specifying the mmQuery parameter, by using the $mmQuery value.
pf: This is the phrase fields definition which specifies the fields on which Solr should generate phrase queries automatically. Similar to the previous parameters that we've specified, we will provide the fields by specifying the pfQuery parameter, by using the $pfQuery value.
bq: This is the boost query that will be used to boost the documents. Again, we use the parameter dereferencing functionality and tell Solr that we will provide the value in the bqQuery parameter, by using the $bqQuery value.
v: This is the final parameter which specifies the content of the query; in our case, the user query will be specified in the mainQuery parameter.

Basically, the preceding queries say that we will use the edismax query parser, phrase, and boost queries. Now let's discuss the values of the parameters.

The first thing is the qfQuery parameter, which is exactly the same as the qf parameter in the first query we sent to Solr. Using it, we just specify the fields that we want to be searched and their boosts. Next, we have the mmQuery parameter set to 1 that will be used as mm in edismax, which means that a document will be considered a match when a single word from the query will be found in it. As you will remember, the pfQuery parameter value will be passed to the pf parameter, and thus the phrase query will be automatically made on the fields defined in those fields.

Now, the last and probably the most important part of the query, the boostQuery parameter, specifies the value that will be passed to the bq parameter. Our boost query is very similar to our main query, however, we say that the query should only match the documents that have all the words from the query (the mm=100% parameter). We also specify that the documents that match that query should be boosted by adding the ^100000 part at the end of it.

To sum up all the parameters of our query, they will promote the documents with all the words from the query present in the fields we want to search on. In addition to this, we will promote the documents that have phrases matched. So finally, let's look at how the newly created handler work. As you can see, when providing our query to it with the mainQuery parameter, the previous document is now placed as the first one. So, we have achieved what we wanted.

Apache Solr 4 Cookbook

By : Rafał Kuć

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr 4 Cookbook

How to get the documents with all the query words to the top of the results set

How to do it...

How it works...