Dealing with failures in distributed systems is comparatively more challenging and time consuming. Also, the Hadoop and YARN frameworks run on commodity hardware and cluster size nowadays; this size can vary from several nodes to several thousand nodes. So handling failure scenarios and dealing with ever-growing scaling issues is very important. In this section, we will focus on failures in the YARN framework: the causes of failures and how to overcome them.
In this chapter, we will cover the following topics:
ResourceManager failures
ApplicationMaster failures
NodeManager failures
Container failures
Hardware failures
We will be dealing with the root causes of these failures and the solutions to them.