Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction


In this chapter, we will discuss how we can use Hadoop to process a dataset and to understand its basic characteristics. We will cover more complex methods like data mining, classification, clustering, and so on, in later chapters.

This chapter will show how you can calculate basic analytics using a given dataset. For the recipes in this chapter, we will use two datasets:

  • The NASA weblog dataset available at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html is a real-life dataset collected using the requests received by NASA web servers. You can find a description of the structure of this data at this link. A small extract of this dataset that can be used for testing is available inside the chapter5/resources folder of the code repository.

  • List of e-mail archives of Apache Tomcat developers available from http://tomcat.apache.org/mail/dev/. These archives are in the MBOX format.

Note

The contents of this chapter are based on the Chapter 6, Analytics, of the previous edition of this...