Book Image

Mastering Hadoop 3

By : Chanchal Singh, Manish Kumar
Book Image

Mastering Hadoop 3

By: Chanchal Singh, Manish Kumar

Overview of this book

Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines.
Table of Contents (23 chapters)
Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Index

Preface

In this book, we will examine advanced concepts of the Hadoop ecosystem and build high-performance Hadoop data pipelines with security, monitoring, and data governance.

We will also promote enterprise-grade applications using Apache Spark and Flink. This book teaches the internal workings of Hadoop, which includes building solutions to some real-world use cases. We will master the best practices for enterprises using Hadoop 3 as a data platform, including authorization and authentication. We will also learn how to model data in Hadoop, gain an in-depth understanding of distributed computing using Hadoop 3, and explore the different batch data-processing patterns.

Lastly, we will understand how components in the Hadoop ecosystem can be integrated effectively to implement a fast and reliable big data pipeline.

Who this book is for

If you want to become a big data professional by mastering the advanced concepts of Hadoop, this book is for you. You'll also find this book useful if you're a Hadoop professional looking to strengthen your knowledge of the Hadoop ecosystem. Fundamental knowledge of the Java programming language and of the basics of Hadoop is necessary to get started with this book.

What this book covers

Chapter 1, Journey to Hadoop 3, introduces the main concepts of Hadoop and outlines its origin. It further focuses on the features of Hadoop 3. This chapter also provides a logical overview of the Hadoop ecosystem and different Hadoop distributions.

Chapter 2, Deep Dive into the Hadoop Distributed File System, focuses on the Hadoop Distributed File System and its internal concepts. It also covers HDFS operations in depth, and introduces you to the new functionality added to the HDFS in Hadoop 3, along with covering HDFS caching and HDFS Federation in detail.

 

 

Chapter 3, YARN Resource Management in Hadoop, introduces you to the resource management framework of YARN. It focuses on efficient scheduling of jobs submitted to YARN and provides a brief overview of the pros and cons of the scheduler available in YARN. It also focuses on the YARN features introduced in Hadoop 3, especially the YARN REST API. It also covers the architecture and internals of Apache Slider. It then focuses on Apache Tez, a distributed processing engine, which helps us to optimize applications running on YARN.

Chapter 4, Internals of MapReduce, introduces a distributed batch processing engine known as Map Reduce. It covers some of the internal concepts of Map Reduce and walks you through each step in detail. It then focuses on a few important parameters and some common patterns in Map Reduce.

Chapter 5, SQL on Hadoop, covers a few important SQL-like engines present in the Hadoop ecosystem. It starts with the details of the architecture of Presto and then covers some examples with a few popular connectors. It then covers the popular query engine, Hive, and focuses on its architecture and a number of advanced-level concepts. Finally, it covers Impala, a fast processing engine, and its internal architectural concepts in detail.

Chapter 6, Real-Time Processing Engines, focuses on different engines available for processing, discussing each processing engine individually. It includes details on the internal workings of Spark Framework and the concept of Resilient Distributed Datasets (RDDs). An introduction to the internals of Apache Flink and Apache Storm/Heron are also focal points of this chapter.

Chapter 7, Widely Used Hadoop Ecosystem Components, introduces you to a few important tools used on the Hadoop platform. It covers Apache Pig, used for ETL operations, and introduces you to a few of the internal concepts of its architecture and operations. It takes you through the details of Apache Kafka and Apache Flume. Apache HBase is also a primary focus of this chapter.

Chapter 8Designing Applications in Hadoop, starts with a few advanced-level concepts related to file formats. It then focuses on data compression and serialization concepts in depth, before covering concepts of data processing and data access and moving to use case examples.

Chapter 9, Real-Time Stream Processing in Hadoop, is focused on designing and implementing real-time and microbatch-oriented applications in Hadoop. This chapter covers how to perform stream data ingestion, along with the role of message queues. It further penetrates some of common stream data-processing patterns, along with low latency design considerations. It elaborates on these concepts with real-time and microbatch case studies.

Chapter 10, Machine Learning in Hadoop, covers how to design and architect machine learning applications on the Hadoop platform. It addresses some of the common machine learning challenges that you can face in Hadoop, and how to solve those. It walks through different machine learning libraries and processing engines. It covers some of the common steps involved in machine learning and further elaborates on this with a case study.

Chapter 11, Hadoop in the Cloud, provides an overview of Hadoop operations in the cloud. It covers detailed information on how the Hadoop ecosystem looks in the cloud, how we should manage resources in the cloud, how we create a data pipeline in the cloud, and how we can ensure high availability across the cloud.

Chapter 12, Hadoop Cluster Profiling, covers tools and techniques for benchmarking and profiling the Hadoop cluster. It also examines aspects of profiling different Hadoop workloads.

Chapter 13, Who Can Do What in Hadoop, is about securing a Hadoop cluster. It covers the basics of Hadoop security. It further focuses on implementing and designing Hadoop authentication and authorization.

Chapter 14, Network and Data Security, is an extension to the previous chapter, covering some advanced concepts in Hadoop network and data security. It covers advanced concepts, such as network segmentation, perimeter security, and row/column level security. It also covers encrypting data in motion and data at rest in Hadoop.

Chapter 15, Monitoring Hadoop, covers the fundamentals of monitoring Hadoop. The chapter is divided into two major sections. One section concerns general Hadoop monitoring, and the remainder of the chapter discusses specialized monitoring for identifying security breaches.

To get the most out of this book

You won't need too much hardware to set up Hadoop. The minimum setup is a single machine / virtual machine, and the recommended setup is three machines.

It is better to have some hands-on experience of writing and running basic programs in Java, as well as some experience of using developer tools such as Eclipse.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

 

 

You can download the code files by following these steps:

  1. Log in or register at www.packt.com.
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the on screen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Hadoop-3. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781788620444_ColorImages.pdf.

Code in action

Visit the following link to check out videos of the code being run:http://bit.ly/2XvW2SD

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

 <property>
         <name>dfs.ha.namenodes.mycluster</name>
         <value>nn1,nn2,nn3</value>
       </property>

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

 <property>
<name>dfs.ha.namenodes.mycluster</name>
         <value>nn1,nn2,nn3</value>
       </property>

Any command-line input or output is written as follows:

      hdfs dfsadmin -fetchImage /home/packt

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.