Imagine that the library on the corner that we used to go to wants to expand its collection and make it available for the wider public though the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so they can share it with the online users. With all the samples provided by the supplier came a problem – how to extract data for the search box from more than 900 thousand PDF files. Solr can do it with the use of Apache Tika. This recipe will show you how to handle such a task.
Before you start getting deeper into the task, please refer to the How to set up the extracting request handler recipe in Chapter 1, Apache Solr Configuration, which will guide you through the process of configuring Solr to use Apache Tika. We will use the same index structure and Solr configuration presented in that recipe, and I assume you already have Solr properly configured (according to the mentioned recipe) and ready to work.