Book Image

Apache Solr for Indexing Data

Book Image

Apache Solr for Indexing Data

Overview of this book

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features. This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You’ll quickly move on to indexing text and boosting the indexing time. Next, you’ll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler. Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we’ll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data. By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.
Table of Contents (18 chapters)
Apache Solr for Indexing Data
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Cores in Solr (Multicore Solr)


Solr cores make it possible to run multiple indexes with different configurations and schemas in a single Solr instance. The multicore feature of Solr helps in unified administration of Solr instances for complete and different applications. Cores in Solr are fairly isolated and have their own configuration and schema files. This helps manage cores at runtime (create or remove) from a Solr instance without restarting the process.

Cores in Solr are managed through a configuration file called solr.xml. The solr.xml file is present in your Solr Home directory. Since its inception, solr.xml has evolved from configuring one core to managing multiple cores and eventually defining parameters for SolrCloud. Do not worry much about SolrCloud if you are not aware of it, as we have a dedicated chapter that covers SolrCloud in detail. In brief, SolrCloud is a terminology used in distributed search and indexing. When we need to index huge amounts of data, we need to think of scalability and performance. This is where SolrCloud comes into the picture.

Starting from Solr 4.3, Solr will maintain two distinct formats for solr.xml; one is legacy and the other is discovery mode. The legacy format will be supported until the 4.x.0 series and it will be deprecated in the 5.0 release of Solr. The default solr.xml config file looks something like this:

<solr>

  <solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">${jetty.port:8983}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
  </solrcloud>

  <shardHandlerFactory name="shardHandlerFactory"
    class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:0}</int>
    <int name="connTimeout">${connTimeout:0}</int>
  </shardHandlerFactory>

</solr>

The preceding configuration shows that Solr configurations are SolrCloud friendly, but this does not mean that Solr is running in SolrCloud mode, unless you start Solr with some special parameters (explained in the SolrCloud Chapter 10, Distributed Indexing). To configure multiple cores in Solr in legacy format, you need to edit the solr.xml file with the following code snippet and remove the existing discovery code from solr.xml:

<solr persistent="false">
    <cores adminPath="/admin/cores" defaultCoreName="core1">
    <core name="core1" instanceDir="core1" />
    <core name="core2" instanceDir="core2" />
  </cores>
</solr>

Now you need to create two cores (new directories, core1 and core2) in the Solr directory. You also need to create Solr configuration files for new cores. To do this, just copy the same configuration files (the conf directory in collections1) in both cores for now and restart the Solr server after you have made these settings.

Once you restart the Solr server with the preceding configuration, two cores will be created, with names core1 and core2 and the existing default Solr configuration settings. The instanceDir variable defines the directory name relative to solr.xml—where to look for configuration and data files. You can modify the paths of these cores according to your wishes and the configuration files according to your use case. You can also change the names of the cores.

You can verify your settings by opening the following URL in your browser: http://localhost:8983/solr/.

You will see two new cores created in the Solr dashboard. Currently, there is no document in any of the cores because we have not indexed any data so far. So, this concludes the process of creating multiple cores in Solr.