Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.
It's used (extensively now) in the processing of streams of data that go well beyond even the largest enterprise datasets in size. Whether it's sensor, clickstream, social media, location-based, or other data that is generated and collected in large gobs, Hadoop is often on the scene in the service of processing and analyzing it. The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there. This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.
The core of Hadoop is its storage system and its distributed computing model:
Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.
MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set (called Reduce). Between Map and Reduce, shuffle and sort occur. Hadoop cluster, once successfully configured on a system, has the following basic components:
This is also called the Head Node/Master Node of the cluster. Primarily, it holds the metadata for HDFS during processing of data which is distributed across the nodes; it keeps track of each HDFS data block in the nodes.
This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed. However, there is no automated way for failing over to the secondary NameNode; if the primary NameNode goes down, a manual intervention is needed. This essentially means that there would be an obvious down time in your cluster in case the NameNode goes down.
The following figure shows you the core components of the Apache Hadoop framework:
Additionally, there are a number of supporting projects for Hadoop, each having their unique purpose for example, to feed input data to Hadoop system, a data warehousing system for ad hoc queries on top of Hadoop, and many more. The following are a few worth mentioning:
Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store. This makes it easier for people with SQL skills to adapt to Hadoop environment quickly.
Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills. You can run PigLatin statements interactively in a command line Pig shell named Grunt. You can also combine a sequence of PigLatin statements in a script, which can then be executed as a unit. These PigLatin statements are used to generate MapReduce jobs by the Pig interpreter and are executed on the HDFS data.
Mahout is a machine-learning library that contains algorithms for clustering and classification. One major focus of machine-learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.
The following figure gives you a 1000 feet view of the Apache Hadoop and the various supporting projects that form this amazing ecosystem:
We will be exploring some of these components in the subsequent chapters of this book, but for a complete reference, please visit the Apache website http://hadoop.apache.org/.
Setting up this ecosystem along with the required supporting projects could be really non-trivial. In fact the only drawback this implementation has, is the effort needed to set up and administer a Hadoop cluster. This is basically the reason that many vendors are coming up with their own distribution of Hadoop bundled and distributed as a data processing platform. Using these distributions, enterprises would be able to set up Hadoop clusters in minutes through simplified and user-friendly cluster deployment wizards and also use the various dashboards for monitoring and instrumentation purposes. Some of the present day distributions are CH4 from Cloudera, Hortonworks Data Platform, and Microsoft HDInsight, which are quickly gaining popularity. These distributions are outside the scope of this book and won't be covered; please visit the respective websites for detailed information about these distributions.