Book Image

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar
Book Image

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.
Table of Contents (23 chapters)
Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Index

Hadoop origins and Timelines


Hadoop is changing the way people think about data. We need to know what led to the origin of this magical innovation. Who developed Hadoop and why? What problems existed before Hadoop? How has it solved these problems? What challenges were encountered during development? How has Hadoop transformed from version 1 to version 3? Let's walk through the origins of Hadoop and its journey to version 3.

Origins

In 1997, Doug Cutting, a co-founder of Hadoop, started working on project Lucene, which is a full-text search library. It was completely written in Java and is a full-text search engine. It analyzes text and builds an index on it. An index is just a mapping of text to locations, so it quickly gives all locations matching particular search patterns. After a few years, Doug made the Lucene project open source; it got a tremendous response from the community and it later became the Apache foundation project.

Once Doug realized that he had enough people who can look into Lucene, he started focusing on indexing web pages. Mike Cafarella joined him for this project to develop a product that can index web pages, and they named this project Apache Nutch. Apache Nutch was also known to be a subproject of Apache Lucene, as Nutch uses the Lucene library to index the content of web pages. Fortunately, with hard work, they made good progress and deployed Nutch on a single machine that was able to index around 100 pages per second.

Scalability is something that people often don't consider while developing initial versions of applications. This was also true of Doug and Mike and the number of web pages that could be indexed was limited to 100 million. In order to index more pages, they increased the number of machines. However, increasing nodes resulted in operational problems because they did not have any underlying cluster manager to perform operational tasks. They wanted to focus more on optimizing and developing robust Nutch applications without worrying about scalability issues.

Doug and Mike wanted a system that had the following features:

  • Fault tolerant: The system should be able to handle any failure of the machines automatically, in an isolated manner. This means the failure of one machine should not affect the entire application.
  • Load balancing: If one machine fails, then its work should be distributed automatically to the working machines in a fair manner.
  • Data loss: They also wanted to make sure that, once data is written to disk, it should never be lost even if one or two machines fail.

They started working on developing a system that can fulfill the aforementioned requirements and spent a few months doing so. However, at the same time, Google published its Google File System. When they read about it, they found it had solutions to similar problems they were trying to solve. They decided to make an implementation based on this research paper and started the development of Nutch Distributed File System (NDFS), which they completed in 2004.

With the help of the Google File System, they solved the scalability and fault tolerance problem that we discussed previously. They used the concept of blocks and replication to do so. Blocks are created by splitting each file into 64 MB chunks (the size is configurable) and replicating each block three times by default so that, if a machine holding one block fails, then data can be served from another machine. The implementation helped them solve all the operational problems they were trying to solve for Apache Nutch. The next section explains the origin of MapReduce.

MapReduce origin

Doug and Mike started working on an algorithm that can process data stored on NDFS. They wanted a system whose performance can be doubled by just doubling the number of machines running the program. At the same time, Google published MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html).

The core idea behind the MapReduce model was to provide parallelism, fault tolerance, and data locality features. Data locality means a program is executed where data is stored instead of bringing the data to the program. MapReduce was integrated into Nutch in 2005. In 2006, Doug created a new incubating project that consisted of HDFS (Hadoop Distributed File System), named after NDFS, MapReduce, and Hadoop Common.

At that time, Yahoo! was struggling with its backend search performance. Engineers at Yahoo! already knew the benefits of Google File System and MapReduce implemented at Google. Yahoo! decided to adopt the capability of Hadoop and they employed Doug to help their engineering team to do so. In 2007, a few more companies who started contributing to Hadoop and Yahoo! reported that they were running 1,000 node Hadoop clusters at the same time.

 

 

NameNodes and DataNodes have a specific role in managing overall clusters. NameNodes are responsible for maintaining metadata information. MapReduce engines have a job tracker and task tracker whose scalability is limited to 40,000 nodes because the overall work of scheduling and tracking is handled by only the job tracker. YARN was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs. It gave Hadoop a new lease of life and Hadoop became a more robust, faster, and more scalable system.

Timelines

We will talk about MapReduce and HDFS in detail later. Let's go through the evolution of Hadoop, which looks as follows:

Year

Event

2003

  • Research paper for Google File System released

2004

  • Research paper for MapReduce released

2006

  • JIRA, mailing list, and other documents created for Hadoop
  • Hadoop Nutch created
  • Hadoop created by moving out NDFS and MapReduce from Nutch
  • Doug Cutting names the project Hadoop, which was the name of his son's yellow elephant toy
  • Release of Hadoop 0.1.0
  • 1.8 TB of data sorts on 188 nodes, which took 47.9 hours
  • Three hundred machines deployed at Yahoo! for the Hadoop cluster
  • Cluster size at Yahoo! increases to 600

2007

  • Two clusters of 1,000 machines run by Yahoo!
  • Hadoop released with HBase
  • Apache pig created at Yahoo!

2008

  • JIRA for YARN opened
  • Twenty companies listed on the Powered by Hadoop page
  • Web index at Yahoo! moved to Hadoop
  • 10,000-core Hadoop cluster used to generate Yahoo!'s production search index
  • First ever Hadoop summit
  • World record created for the fastest sorting (one terabyte of data in 209 seconds) by using 910 node Hadoop clusters
  • Hadoop bags record for terabyte sort
  • Hadoop bags record for terabyte sort benchmark
  • The Yahoo! cluster now has 10 TB loaded to Hadoop every day
  • Cloudera founded as a Hadoop distribution company
  • Google claims to sort 1 terabyte in 68 seconds with its MapReduce implementation

2009

  • 24,000 machines with seven clusters run at Yahoo!
  • Hadoop records sorting petabyte storage
  • Yahoo! claims to sort 1 terabyte in 62 seconds
  • Second Hadoop summit
  • Hadoop core renamed Hadoop Common
  • MapR distribution founded
  • HDFS becomes a separate subproject
  • MapReduce becomes a separate subproject

2010

  • Authentication using Kerberos added to Hadoop
  • Stable version of Apache HBase
  • Yahoo runs 4,000 nodes to process 70 PB
  • Facebook runs 2,300 clusters to process 40 petabytes
  • Apache Hive released
  • Apache Pig released

2011

  • Apache Zookeeper released
  • Contribution of 200,000 lines of code from Facebook, LinkedIn, eBay, and IBM
  • Media Guardian Innovation Awards Top Prize for Hadoop
  • Hortonworks started by Rob Beardon and Eric Badleschieler, who were core members of Hadoop at Yahoo
  • 42,000 nodes of Hadoop cluster at Yahoo, processing petabytes data

2012

  • Hadoop community moves to integrate YARN
  • Hadoop Summit at San Jose
  • Hadoop version 1.0 released

2013

  • Yahoo! deploys YARN in production
  • Hadoop Version 2.2 released

 2014

  • Apache Spark becomes an Apache top-level project
  • Hadoop Version 2.6 released

 

2015

  • Hadoop 2.7 released

2017

  • Hadoop 2.8 released

 

Hadoop 3 has introduced a few more important changes, which we will discuss in upcoming sections in this chapter.