One of the most common problems that users struggle with when using Apache Solr is how to improve the relevancy of their results. Of course, relevancy tuning is, in most cases, connected to your business needs, but one of the common problems is to have documents that have all the query words in their fields at the top of the results list. You can imagine a situation where you search for all the documents that match at least a single query word, but you would like to show the ones with all the query words first. This recipe will show you how to achieve that.
This recipe will show how we can get the documents with all the query words to the top of the results set.
Let's start with the following index structure (add it to the
field
section in yourschema.xml
file):<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" /> <field name="description" type="text" indexed="true" stored="true" />
The second step is to index the following sample data:
<add> <doc> <field name="id">1</field> <field name="name">Solr and all the others</field> <field name="description">This is about Solr</field> </doc> <doc> <field name="id">2</field> <field name="name">Lucene and all the others</field> <field name="description"> This is a book about Solr and Lucene </field> </doc> </add>
Let's assume that our usual queries look similar to the following code snippet:
http://localhost:8983/solr/select?q=solr book&defType=edismax&mm=1&qf=name^10000+description
Nothing complicated; however, the results of such query don't satisfy us, because they look similar to the following code snippet:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="qf">name^10000 description</str> <str name="mm">1</str> <str name="q">solr book</str> <str name="defType">edismax</str> </lst> </lst> <result name="response" numFound="2" start="0"> <doc> <str name="id">1</str> <str name="name">Solr and all the others</str> <str name="description">This is about Solr</str> </doc> <doc> <str name="id">2</str> <str name="name">Lucene and all the others</str> <str name="description"> This is a book about Solr and Lucene </str> </doc> </result> </response>
In order to change this, let's introduce a new handler in our
solrconfig.xml
file:<requestHandler name="/better" class="solr.StandardRequestHandler"> <lst name="defaults"> <str name="indent">true</str> <str name="q"> _query_:"{!edismaxqf=$qfQuery mm=$mmQuerypf= $pfQuerybq=$boostQuery v=$mainQuery}" </str> <str name="qfQuery">name^100000 description</str> <str name="mmQuery">1</str> <str name="pfQuery">name description</str> <str name="boostQuery"> _query_:"{!edismaxqf=$boostQueryQf mm=100% v=$mainQuery}"^100000 </str> <str name="boostQueryQf">name description</str> </lst> </requestHandler>
So, let's send a query to our new handler:
http://localhost:8983/solr/better?mainQuery=solr book
We get the following results:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> </lst> <result name="response" numFound="2" start="0"> <doc> <str name="id">2</str> <str name="name">Lucene and all the others</str> <str name="description"> This is a book about Solr and Lucene </str> </doc> <doc> <str name="id">1</str> <str name="name">Solr and all the others</str> <str name="description">This is about Solr</str> </doc> </result> </response>
As you can see, it works. So let's discuss how.
For the purpose of the recipe, we've used a simple index structure that consists of a document identifier, its name, and description. Our data is very simple as well; it just contains two documents.
During the first query, the document with the identifier 1
is placed at the top of the query results. However, what we would like to achieve is be able to boost the name. In addition to this, we would like to have the documents with words from the query close to each other at the top of the results.
In order to do this, we've defined a new request handler named /better
, which will leverage the local params. The first thing is the defined q
parameter, which is the standard query. It uses the Extended DisMax parser (the {!edismax
part of the query), and defines several additional parameters:
qf
: This defines the fields against whichedismax
should send the query. We tell Solr that we will provide the fields by specifying theqfQuery
parameter by using the$qfQuery
value.mm
: This is the "minimum should match" parameter, which tellsedismax
how many words from the query should be found in a document for the document to be considered a match. We tell Solr that we will provide the fields by specifying themmQuery
parameter, by using the$mmQuery
value.pf
: This is the phrase fields definition which specifies the fields on which Solr should generate phrase queries automatically. Similar to the previous parameters that we've specified, we will provide the fields by specifying thepfQuery
parameter, by using the$pfQuery
value.bq
: This is the boost query that will be used to boost the documents. Again, we use the parameter dereferencing functionality and tell Solr that we will provide the value in thebqQuery
parameter, by using the$bqQuery
value.v
: This is the final parameter which specifies the content of the query; in our case, the user query will be specified in themainQuery
parameter.
Basically, the preceding queries say that we will use the edismax
query parser, phrase, and boost queries. Now let's discuss the values of the parameters.
The first thing is the qfQuery
parameter, which is exactly the same as the qf
parameter in the first query we sent to Solr. Using it, we just specify the fields that we want to be searched and their boosts. Next, we have the mmQuery
parameter set to 1
that will be used as mm
in edismax
, which means that a document will be considered a match when a single word from the query will be found in it. As you will remember, the pfQuery
parameter value will be passed to the pf
parameter, and thus the phrase query will be automatically made on the fields defined in those fields.
Now, the last and probably the most important part of the query, the boostQuery
parameter, specifies the value that will be passed to the bq
parameter. Our boost query is very similar to our main query, however, we say that the query should only match the documents that have all the words from the query (the mm=100%
parameter). We also specify that the documents that match that query should be boosted by adding the ^100000
part at the end of it.
To sum up all the parameters of our query, they will promote the documents with all the words from the query present in the fields we want to search on. In addition to this, we will promote the documents that have phrases matched. So finally, let's look at how the newly created handler work. As you can see, when providing our query to it with the mainQuery
parameter, the previous document is now placed as the first one. So, we have achieved what we wanted.