Book Image

Apache Solr Enterprise Search Server - Third Edition

By : David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell
Book Image

Apache Solr Enterprise Search Server - Third Edition

By: David Smiley, Eric Pugh, Kranti Parisa, Matt Mitchell

Overview of this book

<p>Solr Apache is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.</p> <p>This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.</p>
Table of Contents (19 chapters)
Apache Solr Enterprise Search Server Third Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Nutch for crawling web pages


A very common source of data to search is content in web pages, either from the Internet or inside the firewall. The long-time popular solution for crawling and indexing web pages, especially for millions of them, is Nutch, a former Lucene subproject. If you need to scale to millions of pages up, then consider Nutch or Heritrix. For smaller scales, there are many options (that are also simpler), including ManifoldCF, which is discussed later.

Tip

What about Heritrix?

In the previous editions of the book, we highlighted Heritrix—a crawler sponsored by the Internet Archive that was arguably a more scalable crawler than Nutch. The output files from the crawler are used in the SolrJ example, and there is an example in /examples/9/heritrix-2.0.2/. However, Nutch has shown more development activity than Heritrix in the past couple of years, and thus, we are focusing only on it in this edition.

Nutch is an Internet scale web crawler similar to Google with components such...