Indexing data is one of the most crucial things in every Lucene and Solr deployment. When your data is not indexed properly your search results will be poor. When the search results are poor, it's almost certain the users will not be satisfied with the application that uses Solr. That's why we need our data to be prepared and indexed as well as possible.
On the other hand, preparing data is not an easy task. Nowadays we have more and more data floating around. We need to index multiple formats of data from multiple sources. Do we need to parse the data manually and prepare the data in XML format? The answer is no – we can let Solr do that for us. This chapter will concentrate on the indexing process and data preparation beginning from how to index data that is a binary PDF file, teaching how to use the Data Import Handler to fetch data from database and index it with Apache Solr, and finally describing how we can detect the document's language during indexing.