Practical Site Reliability Engineering

Practical Site Reliability Engineering

By : Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Buy this Book

Practical Site Reliability Engineering

By: Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Buy this Book

Overview of this book

Site reliability engineering (SRE) is being touted as the most competent paradigm in establishing and ensuring next-generation high-quality software solutions. This book starts by introducing you to the SRE paradigm and covers the need for highly reliable IT platforms and infrastructures. As you make your way through the next set of chapters, you will learn to develop microservices using Spring Boot and make use of RESTful frameworks. You will also learn about GitHub for deployment, containerization, and Docker containers. Practical Site Reliability Engineering teaches you to set up and sustain containerized cloud environments, and also covers architectural and design patterns and reliability implementation techniques such as reactive programming, and languages such as Ballerina and Rust. In the concluding chapters, you will get well-versed with service mesh solutions such as Istio and Linkerd, and understand service resilience test practices, API gateways, and edge/fog computing. By the end of this book, you will have gained experience on working with SRE concepts and be able to deliver highly reliable apps and services.

Title Page

Dedication

About Packt

Contributors

Preface

Free Chapter

Demystifying the Site Reliability Engineering Paradigm

Setting the context for practical SRE

Plunging into the SRE discipline

The need for highly reliable platforms and infrastructures

Reactive systems

Highly reliable IT infrastructures

The vitality of the SRE domain

Summary

Microservices Architecture and Containers

What are microservices?

Microservice design principles

Deploying microservices

Practical examples of microservice deployment

Microservices using Spring Boot and the RESTful framework

Jersey Framework

Representational State Transfer (REST)

Important facts about microservices

Summary

Microservice Resiliency Patterns

Briefing microservices and containers

IT reliability challenges and solution approaches

The promising and potential approaches for resiliency and reliability

Summary

DevOps as a Service

What is DaaS?

Collaboration with development and QA teams

Summary

Container Cluster and Orchestration Platforms

Resilient microservices

Application and volume containers

Clustering and managing containers

Container orchestration and management

Summary

Architectural and Design Patterns

Architecture pattern

Design pattern

Summary

Reliability Implementation Techniques

Ballerina programming

Reliability

Rust programming

Summary

Realizing Reliable Systems - the Best Practices

Reliable IT systems – the emerging traits and tips

MSA for reliable software

Service mesh solutions

Microservices design – best practices

Asynchronous messaging patterns for event-driven microservices

The role of EDA to produce reactive applications

Reliable IT infrastructures

Infrastructure as code

Summary

Service Resiliency

Delineating the containerization paradigm

Demystifying microservices architecture

Decoding the growing role of Kubernetes for the container era

Describing the service mesh concept

Why is service mesh paramount?

Service mesh architectures

Summary

Containers, Kubernetes, and Istio Monitoring

Prometheus

Grafana

Summary

Post-Production Activities for Ensuring and Enhancing IT Reliability

Modern IT infrastructure

Monitoring clouds, clusters, and containers

Cloud infrastructure and application monitoring

The monitoring tool capabilities

Prognostic, predictive, and prescriptive analytics

Log analytics

IT operational analytics

IT performance and scalability analytics

IT security analytics

The importance of root-cause analysis

Summary

Reactive systems

We have seen how reliable systems are being realized through the service mesh concept. This is another approach for bringing forth reliable software systems. A reactive system is a new concept based on the widely circulated reactive manifesto. There are reactive programming models and techniques to build viable reactive systems. As described previously, any software system is comprised of multiple modules. Also, multiple components and applications need to interact with each other reliably to accomplish certain complex business functionality. In a reactive system, the individual systems are intelligent. However, the key differentiator is the interaction between the individual parts. That is, the ability to operate individually yet act in concert to achieve the intended outcome clearly differentiates reactive systems from others. A reactive system architecture allows multiple individual applications to co-exist and coalesce as a single unit and react to its surroundings adaptively. This means that they are able to scale up or down based on user and data loads, load balance, and act intelligently to be extremely sensitive and royally responsive.

It is possible to write an application in a reactive style using the proven reactive programming processes, patterns, and platforms. However, for working together to achieve evolving business needs quickly, it needs a lot more. In short, it is not that easy making a system reactive. Reactive systems are generally designed and built according to the tenets of the highly popular Reactive Manifesto. This manifesto document clearly prescribes and promotes the architecture that is responsive, resilient, elastic, and message driven. Increasingly, microservices and message-based service interactions become the widely used standard for having flexible, elastic, resilient, and loosely coupled systems. These characteristics, without an iota of doubt, are the central and core concepts of reactive systems.

Reactive programming is a subset of asynchronous programming. This is an emerging paradigm where the availability of new information (events and messages) drives the processing logic forward. Traditionally, some action gets activated and accomplished using threads of execution based on control and data flows.

This unique programming style intrinsically supports decomposing the problem into multiple discrete steps, and each step can be executed in an asynchronous and non-blocking fashion. Then, those steps can be composed to produce a composite workflow possibly unbounded in its inputs or outputs. Asynchronous processing means the processing of incoming messages or events happen sometime in the future. The event creators and message senders need not wait for the processing and the execution to get done to proceed with their responsibilities. This is generally called non-blocking execution. The threads of execution need not compete for a shared resource to get things done immediately. If the resource is not available immediately, then the threads need not wait for the unavailable resource and instead continue with other tasks at hand, using their respective resources. The point is that they can do their work without any stoppage while waiting for appropriate resources for a particular task at a particular point in time. In other words, they do not prevent the thread of execution from performing other work until the current work is done. They can perform other useful work while the resource is being occupied.

In the future, software applications have to be sensitive and responsive. The futuristic and people-centric applications, therefore, have to be capable of receiving events to be adaptive. Event capturing, storing, and processing are becoming important for enterprise, embedded, and cloud applications. Reactive programming is emerging as an important concept for producing event-driven software applications. There are simple as well as complex events. Events are primarily being streamed continuously, and hence the event-processing feature is known as streaming analytics these days. There are several streaming analytics platforms, such as Spark Streams, Kafka Streams, Apache Flink, Storm, and so on, for extricating actionable insights out of streams.

In the increasingly event-driven world, EDAs and programming models acquire more market and mind shares. And thus reactive programming is a grandiose initiative to provide a standard solution for asynchronous stream processing with non-blocking back pressure. The key benefits of reactive programming include the increased utilization of computing resources on multi-core and multi-processor hardware. There are several competent event-driven programming libraries, middleware solutions, enabling frameworks, and architectures to carefully capture, cleanse, and crunch millions of events per second. The popular libraries for facilitating event-driven programming include Akka Streams, Reactor, RxJava, and Vert.x.

Reactive programming versus reactive systems: There is a huge difference between reactive programming and reactive systems. As indicated previously, reactive programming is primarily event-driven. Reactive systems, on the other hand, are message-driven and focus on creating resilient and elastic software systems. Messages are the prime form of communication and collaboration. Distributed systems coordinate by sending, receiving, and processing messages. Messages are inherently directed, whereas events are not. Messages have a clear direction and destination. Events are facts for others to observe and act upon with confidence and clarity. Messaging is typically asynchronous with the sender and the reader is decoupled. In a message-driven system, addressable recipients wait for messages to arrive. In an event-driven system, consumers are integrated with sources of events and event stores.

In a reactive system, especially one that uses reactive programming, both events and messages will be present. Messages are a great tool for communication, whereas events are the best bet for unambiguously representing facts. Messages ought to be transmitted across the network and form the basis for communication in distributed systems. Messaging is being used to bridge event-driven systems across the network. Event-driven programming is therefore a simple model in a distributed computing environment. That is not the case with messaging in distributed computing environments. Messaging has to do a lot of things because there are several constraints and challenges in distributed computing. That is, messaging has to tackle things such as partial failures, failure detection, dropped/duplicated/reordered messages, eventual consistency, and managing multiple concurrent realities. These differences in semantics and applicability have intense implications in the application design, including things such as resilience, elasticity, mobility, location transparency, and management complexities of distributed systems.

Reactive systems are highly reliable

Reactive systems fully comply with the reactive manifesto (resilient, responsive, elastic, and message-driven), which was contemplated and released by a group of IT product vendors. A variety of architectural design and decision principles are being formulated and firmed up for building most modernized and cognitive systems that are innately capable of fulfilling todays complicated yet sophisticated requirements. Messages are the most optimal unit of information exchange for reactive systems to function and facilitate. These messages create a kind of temporal boundary between application components. Messages enable application components to be decoupled in time (this allows for concurrency) and in space (this allows for distribution and mobility). This decoupling capability facilitates the much-needed isolation among various application services. Such a decoupling ultimately ensures the much-needed resiliency and elasticity, which are the most sought-after needs for producing reliable systems.

Resilience is about the capability of responsiveness even under failure and is an inherent functional property of the system. Resilience is beyond fault-tolerance, which is all about graceful degradation. It is all about fully recovering from any failure. It is empowering systems to self-diagnose and self-heal. This property requires component isolation and containment of failures to avoid failures spreading to neighboring components. If errors and failure are allowed to cascade into other components, then the whole system is bound to fail.

So, the key to designing, developing, and deploying resilient and self-healing systems is to allow any type of failure to be proactively found and contained, encoded as messages, and sent to supervisor components. These can be monitored, measured, and managed from a safe distance. Here, being message-driven is the greatest enabler. Moving away from tightly coupled systems to loosely and lightly coupled systems is the way forward. With less dependency, the affected component can be singled out, and the spread of errors can be nipped in the bud.

The elasticity of reactive systems

Elasticity is about the capability of responsiveness under a load. Systems can be used by many users suddenly, or a lot of data can be pumped by hundreds of thousands of sensors and devices into the system. To tackle this unplanned rush of users and data, systems have to automatically scale up or out by adding additional resources (bare metal servers, virtual machines, and containers). The cloud environments are innately enabled to be auto-scaling based on varying resource needs. This capability makes systems to use their expensive resources in an optimized manner. When resource utilization goes up, the capital and operational costs of systems comes down sharply.

Systems need to be adaptive enough to perform auto-scaling, replication of state, and behavior, load-balancing, fail-over, and upgrades without any manual intervention, instruction, and interpretation. In short, designing, developing, and deploying reactive systems through messaging is the need of the hour.

Practical Site Reliability Engineering

By : Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Practical Site Reliability Engineering

By: Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Overview of this book

Related Content you might be interested in

Current Title:

Practical Site Reliability Engineering

Hands-On RESTful API Design Patterns and Best Practices

Architectural Patterns

Learning Docker

Reactive systems

Reactive systems are highly reliable

The elasticity of reactive systems