Apache Solr 4 Cookbook

Apache Solr 4 Cookbook

By : Rafał Kuć

Buy this Book

Apache Solr 4 Cookbook

By: Rafał Kuć

Buy this Book

Overview of this book

Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features. "Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data. "Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration. With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

Apache Solr 4 Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Apache Solr Configuration

Introduction

Running Solr on Jetty

Running Solr on Apache Tomcat

Installing a standalone ZooKeeper

Clustering your data

Choosing the right directory implementation

Configuring spellchecker to not use its own index

Solr cache configuration

How to fetch and index web pages

How to set up the extracting request handler

Changing the default similarity implementation

Indexing Your Data

Introduction

Indexing PDF files

Generating unique fields automatically

Extracting metadata from binary files

How to properly configure Data Import Handler with JDBC

Indexing data from a database using Data Import Handler

How to import data using Data Import Handler and delta query

How to use Data Import Handler with the URL data source

How to modify data while importing with Data Import Handler

Updating a single field of your document

Handling multiple currencies

Detecting the document's language

Optimizing your primary key field indexing

Analyzing Your Text Data

Introduction

Storing additional information using payloads

Eliminating XML and HTML tags from text

Copying the contents of one field to another

Changing words to other words

Splitting text by CamelCase

Splitting text by whitespace only

Making plural words singular without stemming

Lowercasing the whole string

Storing geographical points in the index

Stemming your data

Preparing text to perform an efficient trailing wildcard search

Splitting text by numbers and non-whitespace characters

Using Hunspell as a stemmer

Using your own stemming dictionary

Protecting words from being stemmed

Querying Solr

Introduction

Asking for a particular field value

Sorting results by a field value

How to search for a phrase, not a single word

Boosting phrases over words

Positioning some documents over others on a query

Positioning documents with words closer to each other first

Sorting results by a distance from a point

Getting documents with only a partial match

Affecting scoring with functions

Nesting queries

Modifying returned documents

Using parent-child relationships

Ignoring typos in terms of performance

Detecting and omitting duplicate documents

Using field aliases

Returning a value of a function in the results

Using the Faceting Mechanism

Introduction

Getting the number of documents with the same field value

Getting the number of documents with the same value range

Getting the number of documents matching the query and subquery

Removing filters from faceting results

Sorting faceting results in alphabetical order

Implementing the autosuggest feature using faceting

Getting the number of documents that don't have a value in the field

Having two different facet limits for two different fields in the same query

Using decision tree faceting

Calculating faceting for relevant documents in groups

Improving Solr Performance

Introduction

Paging your results quickly

Configuring the document cache

Configuring the query result cache

Configuring the filter cache

Improving Solr performance right after the startup or commit operation

Caching whole result pages

Improving faceting performance for low cardinality fields

What to do when Solr slows down during indexing

Analyzing query performance

Avoiding filter caching

Controlling the order of execution of filter queries

Improving the performance of numerical range queries

In the Cloud

Introduction

Creating a new SolrCloud cluster

Setting up two collections inside a single cluster

Managing your SolrCloud cluster

Understanding the SolrCloud cluster administration GUI

Distributed indexing and searching

Increasing the number of replicas on an already live cluster

Stopping automatic document distribution among shards

Using Additional Solr Functionalities

Introduction

Getting more documents similar to those returned in the results list

Highlighting matched words

How to highlight long text fields and get good performance

Sorting results by a function value

Searching words by how they sound

Ignoring defined words

Computing statistics for the search results

Checking the user's spelling mistakes

Using field values to group results

Using queries to group results

Using function queries to group results

Dealing with Problems

Introduction

How to deal with too many opened files

How to deal with out-of-memory problems

How to sort non-English languages properly

How to make your index smaller

Diagnosing Solr problems

How to avoid swapping

Real-life Situations

Introduction

How to implement a product's autocomplete functionality

How to implement a category's autocomplete functionality

How to use different query parsers in a single query

How to get documents right after they were sent for indexation

How to search your data in a near real-time manner

How to get the documents with all the query words to the top of the results set

How to boost documents based on their publishing date

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

How to fetch and index web pages

There are many ways to index web pages. We could download them, parse them, and index them with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem – how to fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.

Getting ready

For the purpose of this task we will be using Version 1.5.1 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.

How to do it...

Let's assume that the website we want to fetch and index is http://lucene.apache.org.

First of all we need to install Apache Nutch. To do that we just need to extract the downloaded archive to the directory of our choice; for example, I installed it in the directory /usr/share/nutch. Of course this is a single server installation and it doesn't include the Hadoop filesystem, but for the purpose of the recipe it will be enough. This directory will be referred to as $NUTCH_HOME.
Then we'll open the file $NUTCH_HOME/conf/nutch-default.xml and set the value http.agent.name to the desired name of your crawler (we've taken SolrCookbookCrawler as a name). It should look like the following code:
```
<property>
<name>http.agent.name</name>
<value>SolrCookbookCrawler</value>
<description>HTTP 'User-Agent' request header.</description>
</property>
```
Now let's create empty directories called crawl and urls in the $NUTCH_HOME directory. After that we need to create the seed.txt file inside the created urls directory with the following contents:
```
http://lucene.apache.org
```
Now we need to edit the $NUTCH_HOME/conf/crawl-urlfilter.txt file. Replace the +.at the bottom of the file with +^http://([a-z0-9]*\.)*lucene.apache.org/. So the appropriate entry should look like the following code:
```
+^http://([a-z0-9]*\.)*lucene.apache.org/
```
One last thing before fetching the data is Solr configuration.
We start with copying the index structure definition file (called schema-solr4.xml) from the $NUTCH_HOME/conf/ directory to your Solr installation configuration directory (which in my case was /usr/share/solr/collection1/conf/). We also rename the copied file to schema.xml.

We also create an empty stopwords_en.txt file or we use the one provided with Solr if you want stop words removal.

Now we need to make two corrections to the schema.xml file we've copied:

The first one is the correction of the version attribute in the schema tag. We need to change its value from 1.5.1 to 1.5, so the final schema tag would look like this:
```
<schema name="nutch" version="1.5.1">
```
Then we change the boost field type (in the same schema.xml file) from string to float, so the boost field definition would look like this:
```
<field name="boost" type="float" stored="true" indexed="false"/>
```

Now we can start crawling and indexing by running the following command from the $NUTCH_HOME directory:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 50

Depending on your Internet connection and your machine configuration you should finally see a message similar to the following one:

crawl finished: crawl-20120830171434

This means that the crawl is completed and the data was indexed to Solr.

How it works...

After installing Nutch and Solr, the first thing we did was set our crawler name. Nutch does not allow empty names so we must choose one. The file nutch-default.xml defines more properties than the mentioned ones, but at this time we only need to know about that one.

In the next step, we created two directories; one (crawl) which will hold the crawl data and the second one (urls) to store the addresses we want to crawl. The contents of the seed.txt file we created contains addresses we want to crawl, one address per line.

The crawl-urlfilter.txt file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org.

The schema.xml file we copied from the Nutch configuration directory is prepared to be used when Solr is used for indexing. But the one for Solr 4.0 is a bit buggy, at least in Nutch 1.5.1 distribution, and that's why we needed to make the changes previously mentioned.

We finally came to the point where we ran the Nutch command. We specified that we wanted to store the crawled data in the crawl directory (first parameter), and the addresses to crawl data from are in the urls directory (second parameter). The –solr switch lets you specify the address of the Solr server that will be responsible for the indexing crawled data and is mandatory if you want to get the data indexed with Solr. We decided to index the data to Solr installed at the same server. The –depth parameter specifies how deep to go after the links defined. In our example, we defined that we want a maximum of three links from the main page. The –topN parameter specifies how many documents will be retrieved from each level, which we defined as 50.

There's more...

There is one more thing worth knowing when you start a journey in the land of Apache Nutch.

Multiple thread crawling

The crawl command of the Nutch command-line utility has another option – it can be configured to run crawling with multiple threads. To achieve that you add the following parameter:

-threads N

So if you would like to crawl with 20 threads you should run the crawl command like sot:

bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 20

Apache Solr 4 Cookbook

By : Rafał Kuć

Apache Solr 4 Cookbook

By: Rafał Kuć

Overview of this book

Related Content you might be interested in

Current Title:

Apache Solr 4 Cookbook

How to fetch and index web pages

Getting ready

How to do it...

How it works...

There's more...

Multiple thread crawling

See also