Sometimes indexing prepared text files (such as XML, CSV, JSON, and so on) is not enough. There are numerous situations where you need to extract data from binary files. For example, one of my clients wanted to index PDF files – actually their contents. To do that, we either need to parse the data in some external application or set up Solr to use Apache Tika. This task will guide you through the process of setting up Apache Tika with Solr.
In order to set up the extracting request handler, we need to follow these simple steps:
First let's edit our Solr instance
solrconfig.xml
and add the following configuration:<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> </lst> </requestHandler>
Next create the
extract
folder anywhere on your system (I created that folder in the directory where Solr is installed), and place theapache-solr-cell-4.0.0.jar
from thedist
directory (you can find it in the Solr distribution archive). After that you have to copy all the libraries from thecontrib/extraction/lib/
directory to theextract
directory you created before.In addition to that, we need the following entries added to the
solrconfig.xml
file:<lib dir="../../extract" regex=".*\.jar" />
And that's actually all that you need to do in terms of configuration.
To simplify the example, I decided to choose the following index structure (place it in the fields
section in your schema.xml
file):
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="text" type="text_general" indexed="true" stored="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
To test the indexing process, I've created a PDF file book.pdf
using PDFCreator which contained the following text only: This is a Solr cookbook
. To index that file, I've used the following command:
curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "[email protected]"
You should see the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">578</int> </lst> </response>
Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler
class to our solrconfig.xml
file as shown in the example.
In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract
directory that we created. The dir
attribute of the lib
tag should be pointing to the path of the created directory. The regex
attribute is the regular expression telling Solr which files to load.
Let's now discuss the default configuration parameters. The fmap.content
parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text
. The next parameter lowernames
is set to true
; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix
, is very important. It tells Solr how to handle fields that are not defined in the schema.xml
file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator
, and we don't have such a field in our index, then Solr would try to index it under a field named attr_creator
which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements.
Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/extract
handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id
parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.
To see how to index binary files please refer to the Indexing PDF files and Extracting metadata from binary files recipes in Chapter 2, Indexing Your Data.