Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes a great deal of work. A full-fledged crawler needs to be distributed, has to obey the best practices such as not overloading servers and obey robots.txt
, performing periodic crawls, prioritizing the pages to crawl, identifying many formats of documents, and so on. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.
In this recipe, we are going to use Apache Nutch in the standalone mode for small-scale intradomain web crawling. Almost all the Nutch commands are implemented as Hadoop MapReduce applications as you would notice when executing steps 10 to 18 of this recipe. Nutch standalone executed these applications using...