Book Image

Apache Mesos Essentials

By : Dharmesh Kakadia
Book Image

Apache Mesos Essentials

By: Dharmesh Kakadia

Overview of this book

<p>Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It allows developers to concurrently run the likes of Hadoop, Spark, Storm, and other applications on a dynamically shared pool of nodes. With Mesos, you have the power to manage a wide range of resources in a multi-tenant environment.</p> <p>Starting with the basics, this book will give you an insight into all the features that Mesos has to offer. You will first learn how to set up Mesos in various environments from data centers to the cloud. You will then learn how to implement self-managed Platform as a Service environment with Mesos using various service schedulers, such as Chronos, Aurora, and Marathon. You will then delve into the depths of Mesos fundamentals and learn how to build distributed applications using Mesos primitives.</p> <p>Finally, you will round things off by covering the operational aspects of Mesos including logging, monitoring, high availability, and recovery.</p>
Table of Contents (15 chapters)
Apache Mesos Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Fault tolerance


Fault tolerance is an important requirement for a data center OS. The ability to keep functioning in the event of a failure becomes indispensable, when working with large-scale systems. Mesos has no single point of failure in the architecture and can continue to operate in case of faults of various entities. There are three modes of fault tolerance that we have to deal with: machine failures, bugs in Mesos processes, and upgrades. Note that all of the points mentioned earlier can happen with any entity in Mesos. There are mainly three components of Mesos that need to be resilient to these faults: master, slave, and framework.

In case of machine running the Mesos slave fails, the master will notice, and inform the frameworks about the slave failure event. The framework can choose to reschedule the tasks running on that slave to other healthy slaves. Once the machine is fixed and the slave process is restarted in a healthy mode, it will reregister with the master and will, again...