Book Image

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar
Book Image

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.
Table of Contents (23 chapters)
Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Index

Hadoop distributions


Hadoop is an open-source project under the Apache Software Foundation, and most components in the Hadoop ecosystem are also open-sourced. Many companies have taken important components and bundled them together to form a complete distribution package that is easier to use and manage. A Hadoop distribution offers the following benefits:

  • Installation: The distribution package provides an easy way to install any component or rpm-like package on clusters. It provides an easy interface too.
  • Packaging: It comes with multiple open-source tools that are well configured to work together. Assume that you want to install and configure each component separately on a multi-node cluster and then test whether it's working properly or not. What if we forget some testing scenarios and the cluster behaves unexpectedly? The Hadoop distribution assures us that we won't face such problems and also provides upgrades or installations of new components by using their package library.
  • Maintenance: The maintenance of a cluster and its components is also a very challenging task, but it is made very simple in of all these distribution packages. They provide us with a nice GUI interface to monitor the health and status of a component. We can also change the configuration to tune or maintain a component to perform well.
  • Support: Most distributions come with 24/7 support. That means that, if you are stuck with any cluster-or distribution-related issue, you don't need to worry much about finding resources to solve the problem. Hadoop Distribution comes with a support package that assures you of technical support and help as and when needed.

 

 

On-premise distribution

There are many distributions available in the market; we will look at the most widely used distributions:

  • Cloudera: Cloudera is an open source Hadoop distribution that was founded in 2008, just when Hadoop started gaining popularity. Cloudera is the oldest distribution available. People at Cloudera are committed to contributing to the open source community and they have contributed to the building of Hive, Impala, Hadoop, Pig, and other popular open-source projects. Cloudera comes with good tools packaged together to provide a good Hadoop experience. They also provide a nice GUI interface to manage and monitor clusters, known as Cloudera manager.
  • Hortonworks: Hortonworks was founded in 2011 and it comes with the Hortonworks Data Platform (HDP), which is an open-source Hadoop distribution. Hortonworks Distribution is widely used in organizations and it provides an Apache Ambari GUI-based interface to manage and monitor clusters. Hortonworks contributes to many open-source projects such as Apache tez, Hadoop, YARN, and Hive. Hortonworks has recently launched a Hortonworks Data Flow (HDF) platform for the purpose of data ingestion and storage. Hortonworks distribution also focuses on the security aspect of Hadoop and has integrated Ranger, Kerberos, and SSL-like security with the HDP and HDF platforms.
  • MapR: MapR was founded in 2009 and it has its own filesystem called MapR-FS, which is quite similar to HDFS but with some new features built by MapR. It boasts higher performance; it also consists of a few nice sets of tools to manage and administer a cluster, and it does not suffer from a single point of failure. It offers some useful features, such as mirroring and snapshots.

 

 

Cloud distributions

Cloud services offer cost-effective solutions in terms of infrastructure setup, monitoring, and maintenance. A large number of organizations do prefer moving their Hadoop infrastructure to the cloud. There are a few popular distributions available for the cloud:

  • Amazon's Elastic MapReduce: Before moving to Hadoop, Amazon had already acquired a large space on the cloud in their infrastructure setup. Amazon provides Elastic MapReduce and many other Hadoop ecosystem tools in their distribution. They have the s3 File System, which is another alternative to HDFS. They offer a cost-effective setup for Hadoop on cloud and it is currently the most actively used cloud on Hadoop distributions.
  • Microsoft Azure: Microsoft offers HDInsight as a Hadoop distribution. It also offers a cost-effective solution for Hadoop infrastructure setup, monitoring and managing cluster resources. Azure claims to provide a fully cloud-based cluster with 99.9% Service Level Agreements (SLA).

Other big companies have also started providing Hadoop on cloud such as Google Cloud Platform, IBM BigInsight, and Cloudera Cloud. You may choose any distribution based on the feasibility and stability of Hadoop tools and components. Most companies offer a free trial for 1 year with lots of free credits for organizational use.