Book Image

Learning YARN

By : Akhil Arora, Shrey Mehrotra, Shreyank Gupta
Book Image

Learning YARN

By: Akhil Arora, Shrey Mehrotra, Shreyank Gupta

Overview of this book

Today enterprises generate huge volumes of data. In order to provide effective services and to make smarter and more intelligent decisions from these huge volumes of data, enterprises use big-data analytics. In recent years, Hadoop has been used for massive data storage and efficient distributed processing of data. The Yet Another Resource Negotiator (YARN) framework solves the design problems related to resource management faced by the Hadoop 1.x framework by providing a more scalable, efficient, flexible, and highly available resource management framework for distributed data processing. This book starts with an overview of the YARN features and explains how YARN provides a business solution for growing big data needs. You will learn to provision and manage single, as well as multi-node, Hadoop-YARN clusters in the easiest way. You will walk through the YARN administration, life cycle management, application execution, REST APIs, schedulers, security framework and so on. You will gain insights about the YARN components and features such as ResourceManager, NodeManager, ApplicationMaster, Container, Timeline Server, High Availability, Resource Localisation and so on. The book explains Hadoop-YARN commands and the configurations of components and explores topics such as High Availability, Resource Localization and Log aggregation. You will then be ready to develop your own ApplicationMaster and execute it over a Hadoop-YARN cluster. Towards the end of the book, you will learn about the security architecture and integration of YARN with big data technologies like Spark and Storm. This book promises conceptual as well as practical knowledge of resource management using YARN.
Table of Contents (20 chapters)
Learning YARN
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Preface

Today enterprises generate huge volumes of data. In order to provide effective services and to make smarter and intelligent decisions from these huge volumes of data, enterprises use big data analytics. In recent years, Hadoop is used for massive data storage and efficient distributed processing of data. YARN framework solves design problems faced by Hadoop 1.x framework by providing a more scalable, efficient, flexible, and highly available resource management framework for distributed data processing. It provides efficient scheduling algorithms and utility components for optimized use of resources of cluster with thousands of nodes, running millions of jobs in parallel.

In this book, you'll explore what YARN provides as a business solution for distributed resource management. You will learn to configure and manage single as well as multi-node Hadoop-YARN clusters. You will also learn about the YARN daemons – ResourceManager, NodeManager, ApplicationMaster, Container, and TimeLine server, and so on.

In subsequent chapters, you will walk through YARN application life cycle management, scheduling and application execution over a Hadoop-YARN cluster. It also covers a detailed explanation of features such as High Availability, Resource Localization, and Log Aggregation. You will learn to write and manage YARN applications with ease.

Toward the end, you will learn about the security architecture and integration of YARN with big data technologies such as Spark and Storm. This book promises conceptual as well as practical knowledge of resource management using YARN.

What this book covers

Chapter 1, Starting with YARN Basics, gives a theoretical overview of YARN, its background, and need. This chapter starts with the limitations in Hadoop 1.x that leads to the evolution of a resource management framework YARN. It also covers features provided by YARN, its architecture, and advantages of using YARN as a cluster ResourceManager for a variety of batch and real-time frameworks.

Chapter 2, Setting up a Hadoop-YARN Cluster, provides a step-by-step process to set up Hadoop-YARN single-node and multi-node clusters, configuration of different YARN components and an overview of YARN's web user interface.

Chapter 3, Administering a Hadoop-YARN Cluster, provides a detailed explanation of the administrative and user commands provided by YARN. It also provides how to guides for configuring YARN, enable log aggregation, auxiliary services, Ganglia integration, JMX monitoring, and health management, and so on.

Chapter 4, Executing Applications Using YARN, explains the process of executing a YARN application over Hadoop-YARN cluster and monitoring it. This chapter describes the application flow and how the components interact during an application execution in a cluster.

Chapter 5, Understanding YARN Life Cycle Management, gives a detailed description of internal classes involved and their core functionalities. It will help readers to understand internals of state transitions of services involved in the YARN application. It will also help in troubleshooting the failures and examining the current application state.

Chapter 6, Migrating from MRv1 to MRv2, involves the steps and configuration changes required to migrate from MRv1 to MRv2 (YARN). Showcase the enhancements made in MRv2 scheduling, job management, and how to re-use MRv1 jobs in YARN. An introduction to MRv2 components integrated with YARN such as MR Job History Server and Application Master.

Chapter 7, Writing Your Own YARN Applications, describes the steps to write your own YARN applications. This includes Java code snippets for various application components definition and order of execution. It also includes detailed explanation of YARN API for creating YARN applications.

Chapter 8, Dive Deep into YARN Components, provides a detailed description of various YARN components, their roles and responsibilities. It'll also covers an overview of additional features provided by YARN such as resource localization, log management, auxiliary services, and so on.

Chapter 9, Exploring YARN REST Services, provides a detailed description of REST-based web services provided by YARN and how we can use the REST services in our applications.

Chapter 10, Scheduling YARN Applications, gives a detailed explanation of Scheduler and Queues provided by YARN for better and efficient scheduling of YARN applications. This chapter also covers the limitations of scheduling in Hadoop 1.x and how the new scheduling framework optimizing the cluster resource utilization.

Chapter 11, Enabling Security in YARN, explains the component and application-level security provided by YARN. It also gives an overview of YARN security architecture for interprocess, intercomponent communication, and token management.

Chapter 12, Real-time Data Analytics Using YARN, explains YARN adoption as a resource manager by various real-time analytics tools such as Apache Spark, Storm, and Giraph.

What you need for this book

In this book, the following are the software applications required:

  • Operating systems:

    • Any Linux operating system (Ubuntu or CentOS)

    • If you wish to choose Windows, then you need to use Oracle VirtualBox to create Linux VM on the Windows machine

  • Software Frameworks:

    • Java (1.6 or higher)

    • Apache Hadoop (2.5.1 or higher)

    • Apache Spark (1.1.1 or higher)

    • Apache Storm (0.9.2 or higher)

  • Development Environment:

    • Eclipse IDE for Java

Who this book is for

Yet Another Resource Negotiator (YARN) is a resource management framework currently integrated with major big data technologies such as Hadoop, Spark, Storm, and so on. People working on big data can use YARN for real-time, as well as batch-oriented data analysis. This book is intended for those who want to understand what YARN is and how efficiently it is used for resource management of large clusters. For cluster administrators, it gives a detailed explanation to provision and manager YARN clusters. If you are a Java developer or an open source contributor, this book will help you drill down the YARN architecture, application execution phases, and application development in YARN. It also helps big data engineers to explore YARN integration with real-time analytics technologies such as Spark, Storm, and so on. This book is a complete package for YARN, starting with YARN's basics and taking things forward to enable readers to create their own YARN applications and integrate with other technologies.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " This chapter uses the Apache tar.gz bundles for setting up Hadoop-YARN clusters and gives an overview of Hortonworks and Cloudera installations."

A block of code is set as follows:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:8020</value>
    <final>true</final>
</property>

Any command-line input or output is written as follows:

hdfs namenode –format

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "View the list of DataNodes connected to the NameNode"

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.