Cloudera Administration Handbook

We live in the era where almost everything surrounding us is generating some kind of data. A click on a web page is being logged on the server. The flipping of channels when watching TV is being captured by cable companies. A search on a search engine is being logged. A heartbeat of a patient in a hospital generates data. A single phone call generates data, which is stored and maintained by telecom companies. An order of pizza generates data. It is very difficult to find processes these days that don't generate and store data.

Why would any organization want to store data? The present and the future belongs to those who hold onto their data and work with it to improve their current operations and innovate to generate newer products and opportunities. Data and the creative use of it is the heart of organizations such as Google, Facebook, Netflix, Amazon, and Yahoo!. They have proven that data, along with powerful analysis, helps in building fantastic and powerful products.

Organizations have been storing data for several years now. However, the data remained on backup tapes or drives. Once it has been archived on storage devices such as tapes, it can only be used in case of emergency to retrieve important data. However, processing or analyzing this data to get insight efficiently is very difficult. This is changing. Organizations want to now use this data to get insight to help understand existing problems, seize new opportunities, and be more profitable. The study and analysis of these vast volumes of data has given birth to a term called big data. It is a phrase often used to promote the importance of the ever-growing data and the technologies applied to analyze this data.

Big and small companies now understand the importance of data and are adding loggers to their operations with an intention to generate more data every day. This has given rise to a very important problem—storage and efficient retrieval of data for analysis. With the data growing at such a rapid rate, traditional tools for storage and analysis fall short. Though these days the cost per byte has reduced considerably and the ability to store more data has increased, the disk transfer rate has remained the same. This has been a bottleneck for processing large volumes of data. Data in many organizations have reached petabytes and is continuing to grow. Several companies have been working to solve this problem and have come out with a few commercial offerings that leverage the power of distributed computing. In this solution, multiple computers work together (a cluster) to store and process large volumes of data in parallel, thus making the analysis of large volumes of data possible. Google, the Internet search engine giant, ran into issues when their data, acquired by crawling the Web, started growing to such large volumes that it was getting increasingly impossible to process. They had to find a way to solve this problem and this led to the creation of Google File System (GFS) and MapReduce.

The GFS or GoogleFS is a filesystem created by Google that enables them to store their large amount of data easily across multiple nodes in a cluster. Once stored, they use MapReduce, a programming model developed by Google to process (or query) the data stored in GFS efficiently. The MapReduce programming model implements a parallel, distributed algorithm on the cluster, where the processing goes to the location where data resides, making it faster to generate results rather than wait for the data to be moved to the processing, which could be a very time consuming activity. Google found tremendous success using this architecture and released white papers for GFS in 2003 and MapReduce in 2004.

Around 2002, Doug Cutting and Mike Cafarella were working on Nutch, an open source web search engine, and faced problems of scalability when trying to store billions of web pages that were crawled everyday by Nutch. In 2004, the Nutch team discovered that the GFS architecture was the solution to their problem and started working on an implementation based on the GFS white paper. They called their filesystem Nutch Distributed File System (NDFS). In 2005, they also implemented MapReduce for NDFS based on Google's MapReduce white paper.

In 2006, the Nutch team realized that their implementations, NDFS and MapReduce, could be applied to more areas and could solve the problems of large data volume processing. This led to the formation of a project called Hadoop. Under Hadoop, NDFS was renamed to Hadoop Distributed File System (HDFS). After Doug Cutting joined Yahoo! in 2006, Hadoop received lot of attention within Yahoo!, and Hadoop became a very important system running successfully on top of a very large cluster (around 1000 nodes). In 2008, Hadoop became one of Apache's top-level projects.

So, Apache Hadoop is a framework written in Java that:

Is used for distributed storage and processing of large volumes of data, which run on top of a cluster and can scale from a single computer to thousands of computers
Uses the MapReduce programming model to process data
Stores and processes data on every worker node (the nodes on the cluster that are responsible for the storage and processing of data) and handles hardware failures efficiently, providing high availability

Apache Hadoop has made distributed computing accessible to anyone who wants to try and process their large volumes of data without shelling out big bucks to commercial offerings. The success of Apache Hadoop implementations in organizations such as Facebook, Netflix, LinkedIn, Twitter, The New York Times, and many more have given the much deserved recognition to Apache Hadoop and in turn good confidence to other organizations to make it a core part of their system. Having made large data analysis a possibility, Hadoop has also given rise to many startups that build analytics products on top of Apache Hadoop.

Cloudera Administration Handbook

By : Rohit Menon

Cloudera Administration Handbook

By: Rohit Menon

Overview of this book

Related Content you might be interested in

Current Title:

Cloudera Administration Handbook

History of Apache Hadoop and its trends