In this chapter, we will discuss how we can use Hadoop to process a dataset and to understand its basic characteristics. We will cover more complex methods like data mining, classification, clustering, and so on, in later chapters.
This chapter will show how you can calculate basic analytics using a given dataset. For the recipes in this chapter, we will use two datasets:
The NASA weblog dataset available at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html is a real-life dataset collected using the requests received by NASA web servers. You can find a description of the structure of this data at this link. A small extract of this dataset that can be used for testing is available inside the
chapter5/resources
folder of the code repository.List of e-mail archives of Apache Tomcat developers available from http://tomcat.apache.org/mail/dev/. These archives are in the MBOX format.
Note
The contents of this chapter are based on the Chapter 6, Analytics, of the previous edition of this...