Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Intradomain web crawling using Apache Nutch


Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes a great deal of work. A full-fledged crawler needs to be distributed, has to obey the best practices such as not overloading servers and obey robots.txt, performing periodic crawls, prioritizing the pages to crawl, identifying many formats of documents, and so on. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.

In this recipe, we are going to use Apache Nutch in the standalone mode for small-scale intradomain web crawling. Almost all the Nutch commands are implemented as Hadoop MapReduce applications as you would notice when executing steps 10 to 18 of this recipe. Nutch standalone executed these applications using...