Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Indexing and searching web documents using Apache Solr


Apache Solr is an open source search platform that is part of the Apache Lucene project. It supports powerful full-text search, faceted search, dynamic clustering, database integration, rich document (for example, Word and PDF) handling, and geospatial search. In this recipe, we are going to index the web pages crawled by Apache Nutch for use by Apache Solr and use Apache Solr to search through those web pages.

Getting ready

  1. Crawl a set of web pages using Apache Nutch by following the Intradomain web crawling using Apache Nutch recipe

  2. Solr 4.8 and later versions require JDK 1.7

How to do it...

The following steps show you how to index and search your crawled web pages dataset:

  1. Download and extract Apache Solr from http://lucene.apache.org/solr/. We use Apache Solr 4.10.3 for the examples in this chapter. From here on, we call the extracted directory as $SOLR_HOME.

  2. Replace the schema.xml file located under $SOLR_HOME/examples/solr/collection1...