Book Image

Becoming a Rockstar SRE

By : Jeremy Proffitt, Rod Anami
Book Image

Becoming a Rockstar SRE

By: Jeremy Proffitt, Rod Anami

Overview of this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Table of Contents (27 chapters)
1
Part 1 - Understanding the Basics of Who, What, and Why
5
Part 2 - Implementing Observability for Site Reliability Engineering
10
Part 3 - Applying Architecture for Reliability
16
Part 4 - Mastering the Outage Moments
20
Part 5 - Looking into Future Trends and Preparing for SRE Interviews

An overview of the daily activities of an SRE

Now that we have examined SRE responsibilities, it’s time to check what you, as an SRE, should be performing on a frequent basis. There’s no better way to understand a profession than by asking what someone does in it. When you go to a job interview, you probably want to know the activities a person in that position will carry out. SREs will have a list of assignments as sticky notes on their displays. We have separated those notable activities into two sections:

  • Reactive work activities
  • Proactive work activities

We’ll start by understanding reactive activities.

Reactive work activities

SREs execute many tasks that don’t lift (or shift) system reliability directly; they are usually operational types of work. Nevertheless, those activities either lessen the service downtime or mitigate risks. Examples of jobs that SREs perform daily in this category are as follows:

  • Repair or restore a system or multiple services to their original state
  • Follow and execute instructions from a runbook (standard operating procedure) during an incident to diagnose the application
  • Implement a change request to apply a patch to a software component
  • Attend a meeting to run a postmortem with system administrators and developers about the recent service or system outage
  • Install a new Kubernetes cluster for a new application according to the development team’s specifications and enable monitoring of it
  • Configure a new cloud-based service for a new application following the architecture design and include it in cloud monitoring
  • Deploy a new software release to VMs and execute the testing scripts

Proactive work activities

SREs also carry out jobs that improve the quality, scalability, observability, manageability, resiliency, or availability of a system or service. Since those tasks increase the reliability levels of specific systems or services, they are considered proactive and mostly engineering type of work. Such assignments affect toil and technical debt. Examples of this category are as follows:

  • Maintain a runbook on how to diagnose problems with a specific application
  • Design and develop an automaton to execute procedures previously documented in a runbook automatically
  • Establish, together with the DevOps team, the release strategy, such as a canary release, A/B testing, or blue-green deployment
  • Work with the SWE to add management code to the application so SREs can instruct the application to do self-administration or self-healing operations
  • Work with the development team to adopt an immutable infrastructure philosophy into the application-building process
  • Instrument the application code to increase its observability with logs and traces
  • Design and implement observability to obtain good metrics, events, logs, and traces from a critical application

Note

Site reliability engineers perform many more activities than the ones listed here. This is not a comprehensive list; the only intention is to show you how SREs work across multiple dimensions and aspects of systems and services.

We listed what an SRE does frequently. We wanted to give you a good sense of their day-to-day activities and how it differs from other roles. Again, this is not a complete or closed list. We want to close this chapter by telling you who our SRE rockstars are.