Practical Site Reliability Engineering

Practical Site Reliability Engineering

By : Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Buy this Book

Practical Site Reliability Engineering

By: Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Buy this Book

Overview of this book

Site reliability engineering (SRE) is being touted as the most competent paradigm in establishing and ensuring next-generation high-quality software solutions. This book starts by introducing you to the SRE paradigm and covers the need for highly reliable IT platforms and infrastructures. As you make your way through the next set of chapters, you will learn to develop microservices using Spring Boot and make use of RESTful frameworks. You will also learn about GitHub for deployment, containerization, and Docker containers. Practical Site Reliability Engineering teaches you to set up and sustain containerized cloud environments, and also covers architectural and design patterns and reliability implementation techniques such as reactive programming, and languages such as Ballerina and Rust. In the concluding chapters, you will get well-versed with service mesh solutions such as Istio and Linkerd, and understand service resilience test practices, API gateways, and edge/fog computing. By the end of this book, you will have gained experience on working with SRE concepts and be able to deliver highly reliable apps and services.

Title Page

Dedication

About Packt

Contributors

Preface

Free Chapter

Demystifying the Site Reliability Engineering Paradigm

Setting the context for practical SRE

Plunging into the SRE discipline

The need for highly reliable platforms and infrastructures

Reactive systems

Highly reliable IT infrastructures

The vitality of the SRE domain

Summary

Microservices Architecture and Containers

What are microservices?

Microservice design principles

Deploying microservices

Practical examples of microservice deployment

Microservices using Spring Boot and the RESTful framework

Jersey Framework

Representational State Transfer (REST)

Important facts about microservices

Summary

Microservice Resiliency Patterns

Briefing microservices and containers

IT reliability challenges and solution approaches

The promising and potential approaches for resiliency and reliability

Summary

DevOps as a Service

What is DaaS?

Collaboration with development and QA teams

Summary

Container Cluster and Orchestration Platforms

Resilient microservices

Application and volume containers

Clustering and managing containers

Container orchestration and management

Summary

Architectural and Design Patterns

Architecture pattern

Design pattern

Summary

Reliability Implementation Techniques

Ballerina programming

Reliability

Rust programming

Summary

Realizing Reliable Systems - the Best Practices

Reliable IT systems – the emerging traits and tips

MSA for reliable software

Service mesh solutions

Microservices design – best practices

Asynchronous messaging patterns for event-driven microservices

The role of EDA to produce reactive applications

Reliable IT infrastructures

Infrastructure as code

Summary

Service Resiliency

Delineating the containerization paradigm

Demystifying microservices architecture

Decoding the growing role of Kubernetes for the container era

Describing the service mesh concept

Why is service mesh paramount?

Service mesh architectures

Summary

Containers, Kubernetes, and Istio Monitoring

Prometheus

Grafana

Summary

Post-Production Activities for Ensuring and Enhancing IT Reliability

Modern IT infrastructure

Monitoring clouds, clusters, and containers

Cloud infrastructure and application monitoring

The monitoring tool capabilities

Prognostic, predictive, and prescriptive analytics

Log analytics

IT operational analytics

IT performance and scalability analytics

IT security analytics

The importance of root-cause analysis

Summary

The vitality of the SRE domain

As discussed previously, the software engineering field is going through a number of disruptions and transformations to cope with the growth being achieved in hardware engineering. There are agile, aspect, agent, composition, service-oriented, polyglot, and adaptive programming styles. At the time of writing this book, building reactive and cognitive applications by leveraging competent development frameworks is being stepped up. On the infrastructure side, we have powerful cloud environments as the one-stop IT solution for hosting and running business workloads. Still, there are a number of crucial challenges in achieving the much-wanted cloud operations with less intervention, interpretation, and involvement from human administrators. Already, there are several tasks getting automated via breakthrough algorithms and tools. Still, there are gaps to be filled with technologically powerful solutions. These well-known and widely used tasks include dynamic and automated capacity planning and management, cloud infrastructure provisioning and resource allocation, software deployment and configuration, patching, infrastructure and software monitoring, measurement and management, and so on. Furthermore, these days, software packages are being frequently updated, patched, and released to a production environment to meet emerging and evolving demands of clients, customers, and consumers. Also, the number of application components (microservices) is growing rapidly. In short, the true IT agility has to be ensured through a whole bunch of automated tools. The operational team with the undivided support of SREs has to envision and safeguard highly optimized and organized IT infrastructures to successfully and sagaciously host and run next-generation software applications. Precisely speaking, the brewing challenge is to automate and orchestrate cloud operations. The cloud has to be self-servicing, self-configuring, self-healing, self-diagnosing, self-defending, and self-governing to be autonomic clouds.

The new and emerging SRE domain is being prescribed as the viable way forward. A new breed of software engineers, who have a special liking of system engineering, are being touted as the best fit to be categorized as SREs. These specially skilled engineers are going to train software developers and system administrators to astutely realize highly competent and dependable software solutions, scripts, and automated tools to speedily setup and sustain highly dependable, dynamic, responsive, and programmable IT infrastructures. An SRE team literally cares about anything that makes complex software systems work in production in a risk-free and continuous manner. In short, a site reliability engineer is a hybrid software and system engineer. Due to the ubiquity and usability of cloud centers for meeting the world's IT needs, the word site represents cloud environments.

Site Reliability Engineers usually care about infrastructure orchestration, automated software deployment, proper monitoring and alerting, scalability and capacity estimation, release procedures, disaster preparedness, fail-over and fail-back capabilities, performance engineering and enhancement (PE2), garbage collector tuning, release automation, capacity uplifts, and so on. They will usually also take an interest in good test coverage. SREs are software engineers who specialize in reliability. SREs are expected to apply the proven and promising principles of computer science and engineering to the design and development of enterprise-class, modular, web-scale, and software applications.

The importance of SREs

An SRE is responsible for ensuring the systems availability, performance-monitoring, and incident response of the cloud IT platforms and services. SREs must make sure that all software applications entering production environments fully comply with a set of important requirements, such as diagrams, network topology illustrations, service dependency details, monitoring and logging plans, backups, and so on. A software application may fully comply with all of the functional requirements, but there are other sources for disruption and interruption. There may be hardware degradation, networking problems, high usage of resources, or slow responses from applications, and services could happen at any time. SREs always need to be extremely sensitive and responsive. The SREs effectiveness may be measured as a function of mean time to recover (MTTR) and mean time to failure (MTTF). In other words, the availability of system functions in the midst of failures and faults has to be guaranteed. Similarly, when the system load varies sharply, the system has to have the inherent potential to do scale up and out.

Software developers typically develop the business functionality of the application and do the necessary unit tests for the functionality they created from scratch or composed out of different, distributed, and decentralized services. But they don't always focus on creating and incorporating the code for achieving scalability, availability, reliability, and so on. System administrators, on the other hand, do everything to design, build, and maintain an organization's IT infrastructure (computing, storage, networking, and security). System administrators do try to achieve these QoS attributes through infrastructure sizing and by provisioning additional infrastructural modules (bare metal (BM) servers, virtual machines (VM) servers, and containers) to authoritatively tackle any sudden rush of users and bigger payloads. As described previously, the central goal of DevOps is to build a healthy and working relationship between the operations and the development teams. Any gaps and other friction between developers and operators ought to be identified and eliminated at the earliest by SREs so as to run any application on any machine or cluster without many twists and tweaks. The most critical challenges are how to ensure NFRs/QoS attributes.

SREs solve a very basic yet important problem that administrators and DevOps professionals do not. The infrastructures resiliency and elasticity to safeguard application scalability and reliability has to be ensured. The business continuity and productivity through minute monitoring of business applications and IT services along with other delights for customers, has to be guaranteed. The meeting of the identified NFRs through infrastructure optimization alone is neither viable nor sustainable. NFRs have to be rather realized by skillfully etching in all the relevant code snippets and segments in the application source code itself. In short, the source code for any application has to be made aware of and is capable of easily absorbing the capacity and capability of the underlying infrastructure. That is, we are destined toward the era of infrastructure-aware applications, and, on the other side, we are heading toward application-aware infrastructures.

This is where SREs pitch in. These specially empowered professionals, with all the education, experience, and expertise, are to assist both developers and system administrators to develop, deploy, and deliver highly reliable software systems via software-defined cloud environments. SREs spend half of their time with developers and the other half with operation team to ensure much-needed reliability. SREs set clear and mathematically modeled service-level agreements (SLAs) that set thresholds for the stability and reliability of software applications.

SREs have many skills:

They have a deep knowledge of complex software systems
They are experts in data structures
They are excellent at designing and analyzing computer algorithms

They have a broad understanding of emerging technologies, tools, and techniques
They are passionate when it comes to coding, debugging, and problem-solving
They have strong analytical skills and intuition
They learn quickly from mistakes and eliminate them in the subsequent assignments
They are team players, willing to share the knowledge they have gained and gathered
They like the adrenaline rush of fast-paced work
They are good at reading technical books, blogs, and publications
They produce and publish technology papers, patents, and best practices

Furthermore, SREs learn and position themselves to be a single point of contact (SPOC) in the following areas:

They have a good understanding of code design, analysis, debugging, and optimization.
They have a wide understanding about various IT systems, ranging from applications to appliances (servers, storage, network components (switches, routers, firewalls, load balancers, intrusion detection and prevention systems, and so on)).
They are competent in emerging technologies:
- Software-defined clouds for highly optimized and organized IT infrastructures
- Data analytics for extracting actionable insights in time.
- IoT for people-centric application design and delivery
- Containerization-sponsored DevOps
- FaaS for simplified IT operations
- Enterprise mobility
- Blockchain for IoT data and device security
- AI (machine and deep-learning algorithms) for predictive and prescriptive insights
- Cognitive computing for realizing smarter applications
- Digital twin for performance increment, failure detection, product productivity, and resilient infrastructures

Conversant with a variety of automated tools
Familiar with reliability engineering concept
Well-versed with the key terms and buzzwords such as scalability, availability, maneuverability, extensibility, and dependability
Good at IT systems operations, application performance management, cyber security attacks and solution approaches
Insights-driven IT operations, administration, maintenance, and enhancement

Toolsets that SREs typically use

In the case of SREs, ensuring the stability and the highest uptime of software applications are the top priorities. However, they should have the ability to take the responsibility and code their own way out of hazards, hurdles, and hitches. They cannot add to the to-do lists of the development teams. SREs are typically software engineers with a passion for system, network, storage, and security administration. They have to have the unique strength of development and operations, and they are highly comfortable with a bevy of script languages, automation tools, and other software solutions to speedily automate the various aspects of IT operations, monitoring, and management, especially application performance management, IT infrastructure orchestration, automation, and optimization. Though automation is the key competency of SREs, SREs ought to educate themselves and gain experience to gain expertise in the following technologies and tools:

Object-oriented, functional, and script languages
Digital technologies (cloud, mobility, IoT, data analytics, and security)
Server, storage, network, and security technologies
System, database, middleware, and platform administration
Compartmentalization (virtualization and containerization) paradigms, DevOps tools
The MSA pattern
Design, integration, performance, scalability, and resiliency patterns
Cluster, grid, utility, and cloud computing models
Troubleshooting software and hardware systems
Dynamic capacity planning, task and resource scheduling, workload optimization, VM and container placement, distributed computing, and serverless computing

AI-enabled operational, performance, security, and log analytics platforms
Cloud orchestration, governance, and brokerage tools
Automated software testing and deployment models
OpenStack and other cloud infrastructure management platforms
Data center optimization and transformation

Practical Site Reliability Engineering

By : Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Practical Site Reliability Engineering

By: Pethuru Raj Chelliah, Shreyash Naithani, Shailender Singh

Overview of this book

Related Content you might be interested in

Current Title:

Practical Site Reliability Engineering

Hands-On RESTful API Design Patterns and Best Practices

Architectural Patterns

Learning Docker

The vitality of the SRE domain

The importance of SREs

Toolsets that SREs typically use