Apache Solr for Indexing Data

Apache Solr for Indexing Data

Overview of this book

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features. This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You’ll quickly move on to indexing text and boosting the indexing time. Next, you’ll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler. Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we’ll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data. By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.

Apache Solr for Indexing Data

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started

Overview and installation of Solr

Running Solr

The Solr architecture and directory structure

Cores in Solr (Multicore Solr)

Summary

Understanding Analyzers, Tokenizers, and Filters

Introducing analyzers

Tokenizers

Filters

Running your analyzer

Summary

Indexing Data

Indexing data in Solr

Building our musicCatalogue example

Facet searching

Summary

Indexing Data – The Basic Technique and Using Index Handlers

Inserting data into Solr

Indexing documents using XML

Indexing documents using JSON

Indexing updates using CSV

Summary

Indexing Data with the Help of Structured Datasources – Using DIH

Indexing data from MySQL

Indexing data using XPath

Summary

Indexing Data Using Apache Tika

Introducing Apache Tika

Configuring Apache Tika in Solr

Indexing PDF and Word documents

Summary

Apache Nutch

Introducing Apache Nutch

Installing Apache Nutch

Configuring Solr with Nutch

Summary

Commits, Real-Time Index Optimizations, and Atomic Updates

Understanding soft commit, optimize, and hard commit

Using atomic updates in Solr

Using RealTime Get

Summary

Advanced Topics – Multilanguage, Deduplication, and Others

Multilanguage indexing

Removing duplicate documents (deduplication)

Content streaming

UIMA integration with Solr

Summary

Distributed Indexing

Setting up SolrCloud

Distributed indexing and searching

Summary

Case Study of Using Solr in E-Commerce

Creating an AutoSuggest feature

Facet navigation

Search filtering and sorting

Relevancy boosting

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The Solr architecture and directory structure

In real-world scenarios, Solr runs with other applications on a web server. A typical example is an online store application. The store provides a user interface, a shopping cart, an items catalogue, and a way to make purchases. It needs to store this information some sort of database. Here, Solr makes easy so add the capability of searching data in the online store. To make data searchable, you need to feed it to Solr for indexing. Data can be fed to Solr in various ways and also in various formats, such as .pdf, .doc, .txt, and so on. In the process of feeding data to Solr, you need to define a schema. A schema is a way of telling Solr about data and how you want to make your data indexed. A lot many factors need to be considered while feeding data, which we will discuss in detail in upcoming chapters.

Solr queries are RESTful, which means that a Solr query is just a simple HTTP request and the response is a structured document, mainly in XML, but it could be JSON, CSV, or any other format as well based on your requirement. A typical architecture of Solr in the real world looks something like this:

Do not worry if you are not able to understand the preceding diagram right now. We will cover every component related to indexing in detail. The purpose of this diagram is to give you a feel of the current architecture of Solr and its working in the real world. If you see the preceding diagram properly, you will find two .xml files named schema.xml and solrconfig.xml. These are the two most important files in the Solr configuration and are considered the building blocks of Solr.

Solr directory structure

Here's the directory layout of a typical Solr Home directory:

| + conf 
|     - schema.xml 
|     - solrconfig.xml 
|     - stopwords.txt
|     - synonyms.txt etc
| + data 
|     - index 
|     - spellchecker

Let's get a brief understanding of solrconfig.xml and schema.xml here before we proceed further, as these are the building blocks of Solr (as stated earlier). We will cover them in detail in the next few chapters.

The solrconfig.xml file is the core configuration file of Solr, with most parameters affecting Solr itself directly. This file can be found in the solr/collection1/conf/ directory. When configuring Solr, you'll work with solrconfig.xml often. The file consists of a series of XML statements that set configuration values, and some of the most important configurations are:

Defining data dir (the directory where indexed files remain)
Request handlers (handle upcoming HTTP requests)
Listeners
Request dispatchers (used to manage HTTP communications)
Admin web interface settings
Replication and duplication parameters

These are some of the important configurations defined in solrconfig.xml. This file is well commented; I would advise you to go through it from the start and read all the comments. You will get a very good understanding of the various components involved in the Solr configuration.

The second most important configuration file is called schema.xml. This file can be found in the solr/collection1/conf/ directory. As the name says, this file is used to define the schema of the data (content) that you want to index and make searchable. Data is called document in Solr terminology. The schema.xml file contains all the details about the fields that your documents can contain, and how these fields should be dealt with when adding documents to the index or when querying those fields. This file can be divided broadly into two sections:

The types section (the definitions of all types)
The fields section (the definitions of the document structure using types)

The structure of your document should be defined as a field under the fields section. Let's say you have to define a book as a document in Solr with fields as isbn, title, author, and price. The schema will be as follows:

<field name="isbn" type="string" required="true" indexed="true" stored="true"/> <field name="title" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text-general" indexed="true" stored="true" multiValued="true"/>
<field name="price" type="int" indexed="true" stored="true"/>

In the preceding schema, you see a type attribute, which defines the data type of the field. You can change the behavior of the field by changing the type. The multiValued attribute is used to tell Solr that the field can hold multiple values, while the required attribute makes the field mandatory for creating a document. After the fields section ends, we need to mention which field is going to be unique. In our case, it is going to be isbn:

<uniqueKey>isbn</uniqueKey>

The schema.xml file is also well-commented file. I will again advise you to go through the comments of this file, for starting this will help you understand the various field types and data types in detail.

Apache Solr for Indexing Data

Apache Solr for Indexing Data

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr for Indexing Data

The Solr architecture and directory structure

Solr directory structure