Book Image

Apache Hadoop 3 Quick Start Guide

By : Hrishikesh Vijay Karambelkar
Book Image

Apache Hadoop 3 Quick Start Guide

By: Hrishikesh Vijay Karambelkar

Overview of this book

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS. The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems. The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring. You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark. By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.
Table of Contents (10 chapters)

Hadoop 3.0 releases and new features

Apache Hadoop development is happening on multiple tracks. The releases of 2.X, 3.0.X, and 3.1.X were simultaneous. Hadoop 3.X was separated from Hadoop 2.x six years ago. We will look at major improvements in the latest releases: 3.X and 2.X. In Hadoop version 3.0, each area has seen a major overhaul, as can be seen in the following quick overview:

  • HDFS benefited from the following:
    • Erasure code
    • Multiple secondary Name Node support
    • Intra-Data Node Balancer
  • Improvements to YARN include the following:
    • Improved support for long-running services
    • Docker support and isolation
    • Enhancements in the Scheduler
    • Application Timeline Service v.2
    • A new User Interface for YARN
    • YARN Federation
  • MapReduce received the following overhaul:
    • Task-level native optimization
    • Feature to device heap-size automatically
  • Overall feature enhancements include the following:
    • Migration to JDK 8
    • Changes in hosted ports
    • Classpath Isolation
    • Shell script rewrite and ShellDoc

Erasure Code (EC) is a one of the major features of the Hadoop 3.X release. It changes the way HDFS stores data blocks. In earlier implementations, the replication of data blocks was achieved by creating replicas of blocks on different node. For a file of 192 MB with a HDFS block size of 64 MB, the old HDFS would create three blocks and, if a cluster has a replication of three, it would require the cluster to store nine different blocks of data—576 MB. So the overhead becomes 200%, additional to the original 192 MB. In the case of EC, instead of replicating the data blocks, it creates parity blocks. In this case, for three blocks of data, the system would create two parity blocks, resulting in a total of 320 MB, which is approximately 66.67% overhead. Although EC achieves significant gain on data storage, it requires additional computing to recover data blocks in case of corruption, slowing down recovery with respect to the traditional way in old Hadoop versions.

A parity drive is a hard drive used in a RAID array to provide fault tolerance. A parity can be achieved with the Boolean XOR function to reconstruct missing data.

We have already seen multiple secondary Name Node support in the architecture section. Intra-Data Node Balancer is used to balance skewed data resulting from the addition or replacement of disks among Hadoop slave nodes. This balancer can be explicitly called from the HDFS shell asynchronously. This can be used when new nodes are added to the system.

In Hadoop v3, YARN Scheduler has been improved in terms of its scheduling strategies and prioritization between queues and applications. Scheduling can be performed among the most eligible nodes rather than one node at a time, driven by heartbeat reporting, as in older versions. YARN is being enhanced with abstract framework to support long-running services; it provides features to manage the life cycle of these services and support upgrades, resizing containers dynamically rather than statically. Another major enhancement is the release of Application Timeline Service v2. This service now supports multiple instances of readers and writes (compared to single instances in older Hadoop versions) with pluggable storage options. The overall metric computation can be done in real time, and it can perform aggregations on collected information. The RESTful APIs are also enhanced to support queries for metric data. YARN User Interface is enhanced significantly, for example, to show better statistics and more information, such as queue. We will be looking at it in Chapter 5, Building Rich YARN Applications and Chapter 6, Monitoring and Administration of a Hadoop Cluster.

Hadoop version 3 and above allows developers to define new resource types (earlier there were only two managed resources: CPU and memory). This enables applications to consider GPUs and disks as resources too. There have been new proposals to allow static resources such as hardware profiles and software versions to be part of the resourcing. Docker has been one of the most successful container applications that the world has adapted rapidly. In Hadoop version 3.0 onward, the experimental/alpha dockerization of YARN tasks is now made part of standard features. So, YARN can be deployed in dockerized containers, giving a complete isolation of tasks. Similarly, MapReduce Tasks are optimized (https://issues.apache.org/jira/browse/MAPREDUCE-2841) further with native implementation of Map output collector for activities such as sort and spill. This enhancement is intended to improve the performance of MapReduce tasks by two to three times.

YARN Federation is a new feature that enables YARN to scale over 100,000 of nodes. This feature allows a very large cluster to be divided into multiple sub-clusters, each running YARN Resource Manager and computations. YARN Federation will bring all these clusters together, making them appear as a single large YARN cluster to the applications. More information about YARN Federation can be obtained from this source.

Another interesting enhancement is migration to newer JDK 8. Here is the supportability matrix for previous and new Hadoop versions and JDK:

Releases Supported JDK
Hadoop 2.6.X JDK 6 onward
Hadoop 2.7.X/2.8.X/2.9.X JDK 7 onward
Hadoop 3.X JDK 8 onward

Earlier, applications often had conflicts due to the single JAR file; however, the new release has two separate jar libraries: server side and client side. This achieves isolation of classpaths between server and client jars. The filesystem is being enhanced to support various types of storage such as Amazon S3, Azure Data Lake storage, and OpenStack Swift storage. Hadoop Command-line interface has been renewed and so are the daemons/processes to start, stop, and configure clusters. With older Hadoop (version 2.X), the heap size for Java and other tasks was required to be set through the map/reduce.java.opts and map/reduce.memory.mb properties. With Hadoop version 3.X, the heap size is derived automatically. All of the default ports used for NameNode, DataNode, and so forth are changed. We will be looking at new ports in the next chapter. In Hadoop 3, the shell scripts are rewritten completely to address some long-standing defects. The new enhancement allows users to add build directories to classpaths; the command to change permissions and the owner of HDFS folder structure will be done as a MapReduce job.