Book Image

Apache Solr 4 Cookbook

By : Rafał Kuć
Book Image

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

<p>Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features.<br /><br />"Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data.<br /><br />"Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration.<br /><br />With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.</p>
Table of Contents (18 chapters)
Apache Solr 4 Cookbook
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

Solr cache configuration


As you may already know, caches play a major role in a Solr deployment. And I'm not talking about some exterior cache – I'm talking about the three Solr caches:

  • Filter cache: This is used for storing filter (query parameter fq) results and mainly enum type facets

  • Document cache: This is used for storing Lucene documents which hold stored fields

  • Query result cache: This is used for storing results of queries

There is a fourth cache – Lucene's internal cache – which is a field cache, but you can't control its behavior. It is managed by Lucene and created when it is first used by the Searcher object.

With the help of these caches we can tune the behavior of the Solr searcher instance. In this task we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember – Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.

Getting ready

Before you start tuning Solr caches you should get some information about your Solr instance. That information is as follows:

  • Number of documents in your index

  • Number of queries per second made to that index

  • Number of unique filter (the fq parameter) values in your queries

  • Maximum number of documents returned in a single query

  • Number of different queries and different sorts

All these numbers can be derived from Solr logs.

How to do it...

For the purpose of this task I assumed the following numbers:

  • Number of documents in the index: 1.000.000

  • Number of queries per second: 100

  • Number of unique filters: 200

  • Maximum number of documents returned in a single query: 100

  • Number of different queries and different sorts: 500

Let's open the solrconfig.xml file and tune our caches. All the changes should be made in the query section of the file (the section between <query> and </query> XML tags).

  1. First goes the filter cache:

    <filterCache
       class="solr.FastLRUCache"
       size="200"
       initialSize="200"
       autowarmCount="100"/>
  2. Second goes the query result cache:

    <queryResultCache
       class="solr.FastLRUCache"
       size="500"
       initialSize="500"
    autowarmCount="250"/>
  3. Third we have the document cache:

    <documentCache
       class="solr.FastLRUCache"
       size="11000"
       initialSize="11000" />

    Of course the above configuration is based on the example values.

  4. Further let's set our result window to match our needs – we sometimes need to get 20–30 more results than we need during query execution. So we change the appropriate value in the solrconfig.xml file to something like this:

    <queryResultWindowSize>200</queryResultWindowSize>

And that's all!

How it works...

Let's start with a little bit of explanation. First of all we use the solr.FastLRUCache implementation instead of solr.LRUCache. So the called FastLRUCache tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache which tends to be more efficient when there are more puts than gets operations. That's why we use it.

This colud be the first time you see cache configuration, so I'll explain what cache configuration parameters mean:

  • class: You probably figured that out by now. Yes, this is the class implementing the cache.

  • size: This is the maximum size that the cache can have.

  • initialSize: This is the initial size that the cache will have.

  • autowarmCount: This is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates the Searcher object – for example, during a commit operation.

As you can see, I tend to use the same number of entries for size and initialSize, and half of those values for autowarmCount. The size and initialSize properties can be set to the same size in order to avoid the underlying Java object resizing, which consumes additional processing time.

There is one thing you should be aware of. Some of the Solr caches (documentCache actually) operate on internal identifiers called docid. Those caches cannot be automatically warmed. That's because docid is changing after every commit operation and thus copying docid is useless.

Please keep in mind that the settings for the size of the caches is usually good for the moment you set them. But during the life cycle of your application your data may change, your queries may change, and your user's behavior may, and probably will change. That's why you should keep track of the cache usage with the use of Solr administration pages, JMX, or a specialized software such as Scalable Performance Monitoring from Sematext (see more at http://sematext.com/spm/index.html), and see how the utilization of each of the caches changes in time and makes proper changes to the configuration.

There's more...

There are a few additional things that you should know when configuring your caches.

Using a filter cache with faceting

If you use the term enumeration faceting method (parameter facet.method=enum) Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.

When we have no cache hits

When your Solr instance has a low cache hit ratio you should consider not using caches at all (to see the hit ratio you can use the administration pages of Solr). Cache insertion is not free – it costs CPU time and resources. So if you see that you have a very low cache hit ratio, you should consider turning your caches off – it may speed up your Solr instance. Before you turn off the caches please ensure that you have the right cache setup – a small hit ratio can be a result of bad cache configuration.

When we have more "puts" than "gets"

When your Solr instance uses put operations more than get operations you should consider using the solr.LRUCache implementation. It's confirmed that this implementation behaves better when there are more insertions into the cache than lookups.

Filter cache

This cache is responsible for holding information about the filters and the documents that match the filter. Actually this cache holds an unordered set of document IDs that match the filter. If you don't use the faceting mechanism with a filter cache, you should at least set its size to the number of unique filters that are present in your queries. This way it will be possible for Solr to store all the unique filters with their matching document IDs and this will speed up the queries that use filters.

Query result cache

The query result cache holds the ordered set of internal IDs of documents that match the given query and the sort specified. That's why if you use caches you should add as many filters as you can and keep your query (the q parameter) as clean as possible. For example, pass only the search box content of your search application to the query parameter. If the same query will be run more than once and the cache has enough capacity to hold the entry, it will be used to give the IDs of the documents that match the query, thus a no Lucene (Solr uses Lucene to index and query data that is indexed) query will be made saving the precious I/O operation for the queries that are not in the cache – this will boost up your Solr instance performance.

The maximum size of this cache that I tend to set is the number of unique queries and their sorts that are handled by my Solr in the time between the Searcher object's invalidation. This tends to be enough in most cases.

Document cache

The document cache holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed – that is because every commit is changing the internal IDs of the documents. Remember that the cache can be memory consuming in case you have many stored fields, so there will be times when you just have to live with evictions.

Query result window

The last thing is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want the maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we need some more documents that follow the first hundred they will be fetched from the cache, and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do a lot of paging, you should consider using a higher query result window value.

Tip

You should remember that the size of caches shown in this task is not final, and you should adapt them to your application needs. The values and the method of their calculation should only be taken as a starting point to further observation and optimization of the process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.

See also

There is another way to warm your caches if you know the most common queries that are sent to your Solr instance – auto-warming queries. Please refer to the Improving Solr performance right after a startup or commit operation recipe in Chapter 6, Improving Solr Performance. For information on how to cache whole pages of results please refer to the Caching whole result pages recipe in Chapter 6, Improving Solr Performance.