A very common source of data to search is content in web pages, either from the Internet or inside the firewall. The long-time popular solution for crawling and indexing web pages, especially for millions of them, is Nutch, a former Lucene subproject. If you need to scale to millions of pages up, then consider Nutch or Heritrix. For smaller scales, there are many options (that are also simpler), including ManifoldCF
, which is discussed later.
Tip
What about Heritrix?
In the previous editions of the book, we highlighted Heritrix—a crawler sponsored by the Internet Archive that was arguably a more scalable crawler than Nutch. The output files from the crawler are used in the SolrJ example, and there is an example in /examples/9/heritrix-2.0.2/
. However, Nutch has shown more development activity than Heritrix in the past couple of years, and thus, we are focusing only on it in this edition.
Nutch is an Internet scale web crawler similar to Google with components such...