Apache Solr is an open source search platform that is part of the Apache Lucene project. It supports powerful full-text search, faceted search, dynamic clustering, database integration, rich document (for example, Word and PDF) handling, and geospatial search. In this recipe, we are going to index the web pages crawled by Apache Nutch for use by Apache Solr and use Apache Solr to search through those web pages.
Crawl a set of web pages using Apache Nutch by following the Intradomain web crawling using Apache Nutch recipe
Solr 4.8 and later versions require JDK 1.7
The following steps show you how to index and search your crawled web pages dataset:
Download and extract Apache Solr from http://lucene.apache.org/solr/. We use Apache Solr 4.10.3 for the examples in this chapter. From here on, we call the extracted directory as
$SOLR_HOME
.Replace the
schema.xml
file located under$SOLR_HOME/examples/solr/collection1...