Apache Solr 4 Cookbook

Apache Solr 4 Cookbook

By : Rafał Kuć

Buy this Book

Apache Solr 4 Cookbook

By: Rafał Kuć

Buy this Book

Overview of this book

Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features. "Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data. "Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration. With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

Apache Solr 4 Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Apache Solr Configuration

Introduction

Running Solr on Jetty

Running Solr on Apache Tomcat

Installing a standalone ZooKeeper

Clustering your data

Choosing the right directory implementation

Configuring spellchecker to not use its own index

Solr cache configuration

How to fetch and index web pages

How to set up the extracting request handler

Changing the default similarity implementation

Indexing Your Data

Introduction

Indexing PDF files

Generating unique fields automatically

Extracting metadata from binary files

How to properly configure Data Import Handler with JDBC

Indexing data from a database using Data Import Handler

How to import data using Data Import Handler and delta query

How to use Data Import Handler with the URL data source

How to modify data while importing with Data Import Handler

Updating a single field of your document

Handling multiple currencies

Detecting the document's language

Optimizing your primary key field indexing

Analyzing Your Text Data

Introduction

Storing additional information using payloads

Eliminating XML and HTML tags from text

Copying the contents of one field to another

Changing words to other words

Splitting text by CamelCase

Splitting text by whitespace only

Making plural words singular without stemming

Lowercasing the whole string

Storing geographical points in the index

Stemming your data

Preparing text to perform an efficient trailing wildcard search

Splitting text by numbers and non-whitespace characters

Using Hunspell as a stemmer

Using your own stemming dictionary

Protecting words from being stemmed

Querying Solr

Introduction

Asking for a particular field value

Sorting results by a field value

How to search for a phrase, not a single word

Boosting phrases over words

Positioning some documents over others on a query

Positioning documents with words closer to each other first

Sorting results by a distance from a point

Getting documents with only a partial match

Affecting scoring with functions

Nesting queries

Modifying returned documents

Using parent-child relationships

Ignoring typos in terms of performance

Detecting and omitting duplicate documents

Using field aliases

Returning a value of a function in the results

Using the Faceting Mechanism

Introduction

Getting the number of documents with the same field value

Getting the number of documents with the same value range

Getting the number of documents matching the query and subquery

Removing filters from faceting results

Sorting faceting results in alphabetical order

Implementing the autosuggest feature using faceting

Getting the number of documents that don't have a value in the field

Having two different facet limits for two different fields in the same query

Using decision tree faceting

Calculating faceting for relevant documents in groups

Improving Solr Performance

Introduction

Paging your results quickly

Configuring the document cache

Configuring the query result cache

Configuring the filter cache

Improving Solr performance right after the startup or commit operation

Caching whole result pages

Improving faceting performance for low cardinality fields

What to do when Solr slows down during indexing

Analyzing query performance

Avoiding filter caching

Controlling the order of execution of filter queries

Improving the performance of numerical range queries

In the Cloud

Introduction

Creating a new SolrCloud cluster

Setting up two collections inside a single cluster

Managing your SolrCloud cluster

Understanding the SolrCloud cluster administration GUI

Distributed indexing and searching

Increasing the number of replicas on an already live cluster

Stopping automatic document distribution among shards

Using Additional Solr Functionalities

Introduction

Getting more documents similar to those returned in the results list

Highlighting matched words

How to highlight long text fields and get good performance

Sorting results by a function value

Searching words by how they sound

Ignoring defined words

Computing statistics for the search results

Checking the user's spelling mistakes

Using field values to group results

Using queries to group results

Using function queries to group results

Dealing with Problems

Introduction

How to deal with too many opened files

How to deal with out-of-memory problems

How to sort non-English languages properly

How to make your index smaller

Diagnosing Solr problems

How to avoid swapping

Real-life Situations

Introduction

How to implement a product's autocomplete functionality

How to implement a category's autocomplete functionality

How to use different query parsers in a single query

How to get documents right after they were sent for indexation

How to search your data in a near real-time manner

How to get the documents with all the query words to the top of the results set

How to boost documents based on their publishing date

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Clustering your data

After the release of Apache Solr 4.0, many users will want to leverage SolrCloud distributed indexing and querying capabilities. It's not hard to upgrade your current cluster to SolrCloud, but there are some things you need to take care of. With the help of the following recipe you will be able to easily upgrade your cluster.

Getting ready

Before continuing further it is advised to read the Installing a standalone ZooKeeper recipe in this chapter. It shows how to set up a ZooKeeper cluster in order to be ready for production use.

How to do it...

In order to use your old index structure with SolrCloud, you will need to add the following field to your fields definition (add the following fragment to the schema.xml file, to its fields section):

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>

Now let's switch to the solrconfig.xml file – starting with the replication handlers. First, you need to ensure that you have the replication handler set up. Remember that you shouldn't add master or slave specific configurations to it. So the replication handlers' configuration should look like the following code:

<requestHandler name="/replication" class="solr.ReplicationHandler" />

In addition to that, you will need to have the administration panel handlers present, so the following configuration entry should be present in your solrconfig.xml file:

<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />

The last request handler that should be present is the real-time get handler, which should be defined as follows (the following should also be added to the solrconfig.xml file):

<requestHandler name="/get" class="solr.RealTimeGetHandler">
  <lst name="defaults">
    <str name="omitHeader">true</str>
  </lst>
</requestHandler>

The next thing SolrCloud needs in order to properly operate is the transaction log configuration. The following fragment should be added to the solrconfig.xml file:

<updateLog>
  <str name="dir">${solr.data.dir:}</str>
</updateLog>

The last thing is the solr.xml file. It should be pointing to the default cores administration address – the cores tag should have the adminPath property set to the /admin/cores value. The example solr.xml file could look like the following code:

<solr persistent="true">
  <cores adminPath="/admin/cores" defaultCoreName="collection1" host="localhost" hostPort="8983" zkClientTimeout="15000">
    <core name="collection1" instanceDir="collection1" />
  </cores>
</solr>

And that's all, your Solr instances configuration files are now ready to be used with SolrCloud.

How it works...

So now let's see why all those changes are needed in order to use our old configuration files with SolrCloud.

The _version_ field is used by Solr to enable documents versioning and optimistic locking, which ensures that you won't have the newest version of your document overwritten by mistake. Because of that, SolrCloud requires the _version_ field to be present in your index structure. Adding that field is simple – you just need to place another field definition that is stored and indexed, and based on the long type. That's all.

As for the replication handler, you should remember not to add slave or master specific configuration, only the simple request handler definition, as shown in the previous example. The same applies to the administration panel handlers: they need to be available under the default URL address.

The real-time get handler is responsible for getting the updated documents right away, even if no commit or the softCommit command is executed. This handler allows Solr (and also you) to retrieve the latest version of the document without the need for re-opening the searcher, and thus even if the document is not yet visible during usual search operations. The configuration is very similar to the usual request handler configuration – you need to add a new handler with the name property set to /get and the class property set to solr.RealTimeGetHandler. In addition to that, we want the handler to be omitting response headers (the omitHeader property set to true).

One of the last things that is needed by SolrCloud is the transaction log, which enables real-time get operations to be functional. The transaction log keeps track of all the uncommitted changes and enables a real-time get handler to retrieve those. In order to turn on transaction log usage, one should add the updateLog tag to the solrconfig.xml file and specify the directory where the transaction log directory should be created (by adding the dir property as shown in the example). In the configuration previously shown, we tell Solr that we want to use the Solr data directory as the place to store the transaction log directory.

Finally, Solr needs you to keep the default address for the core administrative interface, so you should remember to have the adminPath property set to the value shown in the example (in the solr.xml file). This is needed in order for Solr to be able to manipulate cores.

Apache Solr 4 Cookbook

By : Rafał Kuć

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr 4 Cookbook

Clustering your data

Getting ready

How to do it...

How it works...