Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster


Crawling a large amount of web documents can be done efficiently by utilizing the power of a MapReduce cluster.

Note

As of Apache Nutch 2.2.1 release, the Nutch project has not officially migrated to Hadoop 2.x and still depends on Hadoop 1.x for the whole web crawling. However, it is possible to execute the Nutch jobs using a Hadoop 2.x cluster utilizing the backward compatibility nature of Hadoop.

Nutch HBaseStore integration further depends on HBase 0.90.6, which doesn't support Hadoop 2. Hence, this recipe works only with a Hadoop 1.x cluster. We are looking forward to a new Nutch release with full Hadoop 2.x support.

Getting ready

We assume you already have your Hadoop 1.x and HBase cluster deployed.

How to do it...

The following steps show you how to use Apache Nutch with a Hadoop MapReduce cluster and an HBase data store to perform large-scale web crawling:

  1. Make sure the hadoop command is accessible from the command line. If...