As you may already know, caches play a major role in a Solr deployment. And I'm not talking about some exterior cache – I'm talking about the three Solr caches:
There is a fourth cache – Lucene's internal cache
– which is a field cache, but you can't control its behavior. It is managed by Lucene and created when it is first used by the Searcher
object.
With the help of these caches we can tune the behavior of the Solr searcher instance. In this task we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember – Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.
Before you start tuning Solr caches you should get some information about your Solr instance. That information is as follows:
Number of documents in your index
Number of queries per second made to that index
Number of unique filter (the
fq
parameter) values in your queriesMaximum number of documents returned in a single query
Number of different queries and different sorts
For the purpose of this task I assumed the following numbers:
Number of documents in the index:
1.000.000
Number of queries per second:
100
Number of unique filters:
200
Maximum number of documents returned in a single query:
100
Number of different queries and different sorts:
500
Let's open the solrconfig.xml
file and tune our caches. All the changes should be made in the query section of the file (the section between <query>
and </query>
XML tags).
First goes the filter cache:
<filterCache class="solr.FastLRUCache" size="200" initialSize="200" autowarmCount="100"/>
Second goes the query result cache:
<queryResultCache class="solr.FastLRUCache" size="500" initialSize="500" autowarmCount="250"/>
Third we have the document cache:
<documentCache class="solr.FastLRUCache" size="11000" initialSize="11000" />
Of course the above configuration is based on the example values.
Further let's set our result window to match our needs – we sometimes need to get 20–30 more results than we need during query execution. So we change the appropriate value in the
solrconfig.xml
file to something like this:<queryResultWindowSize>200</queryResultWindowSize>
And that's all!
Let's start with a little bit of explanation. First of all we use the solr.FastLRUCache
implementation instead of solr.LRUCache
. So the called FastLRUCache
tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache
which tends to be more efficient when there are more puts
than gets
operations. That's why we use it.
This colud be the first time you see cache configuration, so I'll explain what cache configuration parameters mean:
class
: You probably figured that out by now. Yes, this is the class implementing the cache.size
: This is the maximum size that the cache can have.initialSize
: This is the initial size that the cache will have.autowarmCount
: This is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates theSearcher
object – for example, during a commit operation.
As you can see, I tend to use the same number of entries for size
and initialSize
, and half of those values for autowarmCount
. The size
and initialSize
properties can be set to the same size in order to avoid the underlying Java object resizing, which consumes additional processing time.
There is one thing you should be aware of. Some of the Solr caches (documentCache
actually) operate on internal identifiers called docid
. Those caches cannot be automatically warmed. That's because docid
is changing after every commit operation and thus copying docid
is useless.
Please keep in mind that the settings for the size of the caches is usually good for the moment you set them. But during the life cycle of your application your data may change, your queries may change, and your user's behavior may, and probably will change. That's why you should keep track of the cache usage with the use of Solr administration pages, JMX, or a specialized software such as Scalable Performance Monitoring from Sematext (see more at http://sematext.com/spm/index.html), and see how the utilization of each of the caches changes in time and makes proper changes to the configuration.
There are a few additional things that you should know when configuring your caches.
If you use the term enumeration faceting method (parameter facet.method=enum
) Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.
When your Solr instance has a low cache hit ratio you should consider not using caches at all (to see the hit ratio you can use the administration pages of Solr). Cache insertion is not free – it costs CPU time and resources. So if you see that you have a very low cache hit ratio, you should consider turning your caches off – it may speed up your Solr instance. Before you turn off the caches please ensure that you have the right cache setup – a small hit ratio can be a result of bad cache configuration.
When your Solr instance uses put operations more than get operations you should consider using the solr.LRUCache
implementation. It's confirmed that this implementation behaves better when there are more insertions into the cache than lookups.
This cache is responsible for holding information about the filters and the documents that match the filter. Actually this cache holds an unordered set of document IDs that match the filter. If you don't use the faceting mechanism with a filter cache, you should at least set its size to the number of unique filters that are present in your queries. This way it will be possible for Solr to store all the unique filters with their matching document IDs and this will speed up the queries that use filters.
The query result cache holds the ordered set of internal IDs of documents that match the given query and the sort specified. That's why if you use caches you should add as many filters as you can and keep your query (the q
parameter) as clean as possible. For example, pass only the search box content of your search application to the query parameter. If the same query will be run more than once and the cache has enough capacity to hold the entry, it will be used to give the IDs of the documents that match the query, thus a no Lucene (Solr uses Lucene to index and query data that is indexed) query will be made saving the precious I/O operation for the queries that are not in the cache – this will boost up your Solr instance performance.
The maximum size of this cache that I tend to set is the number of unique queries and their sorts that are handled by my Solr in the time between the Searcher
object's invalidation. This tends to be enough in most cases.
The document cache holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed – that is because every commit is changing the internal IDs of the documents. Remember that the cache can be memory consuming in case you have many stored fields, so there will be times when you just have to live with evictions.
The last thing is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want the maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we need some more documents that follow the first hundred they will be fetched from the cache, and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do a lot of paging, you should consider using a higher query result window value.
Tip
You should remember that the size of caches shown in this task is not final, and you should adapt them to your application needs. The values and the method of their calculation should only be taken as a starting point to further observation and optimization of the process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.
There is another way to warm your caches if you know the most common queries that are sent to your Solr instance – auto-warming queries. Please refer to the Improving Solr performance right after a startup or commit operation recipe in Chapter 6, Improving Solr Performance. For information on how to cache whole pages of results please refer to the Caching whole result pages recipe in Chapter 6, Improving Solr Performance.