Book Image

Apache Solr for Indexing Data

Book Image

Apache Solr for Indexing Data

Overview of this book

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features. This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You’ll quickly move on to indexing text and boosting the indexing time. Next, you’ll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler. Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we’ll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data. By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.
Table of Contents (18 chapters)
Apache Solr for Indexing Data
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Installing Apache Nutch


Apache Nutch comes in two versions (1.x and 2.x). For this example, we'll be using version 1.x, as it contains a binary that will help reduce the time taken to build version 2.x from scratch. The latest stable version of Apache Nutch (v1.10), which also contains a binary at the time of writing this book, can be installed by following these steps:

  1. Download and unzip Apache Nutch (apache-nutch-1.10-bin.tar.gz) from http://nutch.apache.org/downloads.html.

  2. Extract the archive file into a folder of your choice. We'll use %NUTCH_HOME% as the folder where the ZIP file is to be extracted.

Note

On Windows, we can install Cygwin by going to the installation link at http://cygwin.com/install.html.

Let's verify the downloaded archive by going to %NUTCH_HOME%/bin. It will contain the Nutch script, which we can execute. We run the following command to get a list of available options that we can use:

$ cd %NUTCH_HOME%/bin
$ ./nutch

We should get the following output from the command:

Usage...