Apache Solr for Indexing Data

Apache Solr for Indexing Data

Overview of this book

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features. This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You’ll quickly move on to indexing text and boosting the indexing time. Next, you’ll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler. Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we’ll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data. By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.

Apache Solr for Indexing Data

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started

Overview and installation of Solr

Running Solr

The Solr architecture and directory structure

Cores in Solr (Multicore Solr)

Summary

Understanding Analyzers, Tokenizers, and Filters

Introducing analyzers

Tokenizers

Filters

Running your analyzer

Summary

Indexing Data

Indexing data in Solr

Building our musicCatalogue example

Facet searching

Summary

Indexing Data – The Basic Technique and Using Index Handlers

Inserting data into Solr

Indexing documents using XML

Indexing documents using JSON

Indexing updates using CSV

Summary

Indexing Data with the Help of Structured Datasources – Using DIH

Indexing data from MySQL

Indexing data using XPath

Summary

Indexing Data Using Apache Tika

Introducing Apache Tika

Configuring Apache Tika in Solr

Indexing PDF and Word documents

Summary

Apache Nutch

Introducing Apache Nutch

Installing Apache Nutch

Configuring Solr with Nutch

Summary

Commits, Real-Time Index Optimizations, and Atomic Updates

Understanding soft commit, optimize, and hard commit

Using atomic updates in Solr

Using RealTime Get

Summary

Advanced Topics – Multilanguage, Deduplication, and Others

Multilanguage indexing

Removing duplicate documents (deduplication)

Content streaming

UIMA integration with Solr

Summary

Distributed Indexing

Setting up SolrCloud

Distributed indexing and searching

Summary

Case Study of Using Solr in E-Commerce

Creating an AutoSuggest feature

Facet navigation

Search filtering and sorting

Relevancy boosting

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Cores in Solr (Multicore Solr)

Solr cores make it possible to run multiple indexes with different configurations and schemas in a single Solr instance. The multicore feature of Solr helps in unified administration of Solr instances for complete and different applications. Cores in Solr are fairly isolated and have their own configuration and schema files. This helps manage cores at runtime (create or remove) from a Solr instance without restarting the process.

Cores in Solr are managed through a configuration file called solr.xml. The solr.xml file is present in your Solr Home directory. Since its inception, solr.xml has evolved from configuring one core to managing multiple cores and eventually defining parameters for SolrCloud. Do not worry much about SolrCloud if you are not aware of it, as we have a dedicated chapter that covers SolrCloud in detail. In brief, SolrCloud is a terminology used in distributed search and indexing. When we need to index huge amounts of data, we need to think of scalability and performance. This is where SolrCloud comes into the picture.

Starting from Solr 4.3, Solr will maintain two distinct formats for solr.xml; one is legacy and the other is discovery mode. The legacy format will be supported until the 4.x.0 series and it will be deprecated in the 5.0 release of Solr. The default solr.xml config file looks something like this:

<solr>

  <solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">${jetty.port:8983}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
  </solrcloud>

  <shardHandlerFactory name="shardHandlerFactory"
    class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:0}</int>
    <int name="connTimeout">${connTimeout:0}</int>
  </shardHandlerFactory>

</solr>

The preceding configuration shows that Solr configurations are SolrCloud friendly, but this does not mean that Solr is running in SolrCloud mode, unless you start Solr with some special parameters (explained in the SolrCloud Chapter 10, Distributed Indexing). To configure multiple cores in Solr in legacy format, you need to edit the solr.xml file with the following code snippet and remove the existing discovery code from solr.xml:

<solr persistent="false">
    <cores adminPath="/admin/cores" defaultCoreName="core1">
    <core name="core1" instanceDir="core1" />
    <core name="core2" instanceDir="core2" />
  </cores>
</solr>

Now you need to create two cores (new directories, core1 and core2) in the Solr directory. You also need to create Solr configuration files for new cores. To do this, just copy the same configuration files (the conf directory in collections1) in both cores for now and restart the Solr server after you have made these settings.

Once you restart the Solr server with the preceding configuration, two cores will be created, with names core1 and core2 and the existing default Solr configuration settings. The instanceDir variable defines the directory name relative to solr.xml—where to look for configuration and data files. You can modify the paths of these cores according to your wishes and the configuration files according to your use case. You can also change the names of the cores.

You can verify your settings by opening the following URL in your browser: http://localhost:8983/solr/.

You will see two new cores created in the Solr dashboard. Currently, there is no document in any of the cores because we have not indexed any data so far. So, this concludes the process of creating multiple cores in Solr.

Apache Solr for Indexing Data

Apache Solr for Indexing Data

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr for Indexing Data

Cores in Solr (Multicore Solr)