Book Image

Becoming a Rockstar SRE

By : Jeremy Proffitt, Rod Anami
Book Image

Becoming a Rockstar SRE

By: Jeremy Proffitt, Rod Anami

Overview of this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Table of Contents (27 chapters)
1
Part 1 - Understanding the Basics of Who, What, and Why
5
Part 2 - Implementing Observability for Site Reliability Engineering
10
Part 3 - Applying Architecture for Reliability
16
Part 4 - Mastering the Outage Moments
20
Part 5 - Looking into Future Trends and Preparing for SRE Interviews

DevOps engineers versus SRE versus others

This is one of the most frequently asked questions we receive from customers and organizations: how does the site reliability engineering profession differ from other existing technical roles? We already talked about how SREs are the connection between the different steps of the solution life cycle. Here, we’ll focus our discussion on the DevOps engineer role, and later, we’ll broaden it. We have split this discussion into two sections:

  • DevOps and site reliability engineers
  • Software and site reliability engineers

DevOps and site reliability engineers

Google described the relationship between DevOps and SRE with a famous subtitle in their The Site Reliability Workbook publication:

Class SRE implements interface DevOps

This statement is an elegant way to define this link and refers to Java programming. It implies that site reliability engineering describes and deepens the implementation of whatever DevOps is. Moreover, we can say that site reliability engineering has commonalities with DevOps as a logically derived conclusion. However, what exactly does site reliability engineering implement from DevOps, or what are the differences between a site reliability engineer and a DevOps engineer? We have visualized these similarities and divergences in an infographic as follows:

Figure 1.3 – An infographic on SRE and DevOps

Figure 1.3 – An infographic on SRE and DevOps

Notice that they have shared values. Both SREs and DevOps engineers require those values in the orange (bottom right in the above diagram) box. In the bottom-left table, you can see the difference between those roles. Typically, site reliability engineers resolve operational problems by applying the right software engineering disciplines. On the other hand, DevOps engineers resolve development and delivery pipeline issues with systems management techniques mainly by using automation and infrastructure-as-code. They also concentrate different levels of effort on distinct phases of the solution life cycle, as depicted in the infographic.

It’s not rare to hear that DevOps is a shift-right transformation while site reliability engineering is a shift-left one. That implies moving from the left (development side of the equation) to the right (operations side of the equation), and vice versa. Another term we hear a lot is DevSecOps, which has the addition of security. Since security has always been implied in these roles, we think including new letters in the middle is confusing and redundant.

SREs and DevOps engineers are, in our opinion, different sides of the same coin. They should be more like best friends forever than opposing roles as they share values. Let’s check how SREs fulfill those values from the five main areas of DevOps:

  • Reduce organizational silos: SREs use the same tooling as developers or DevOps engineers. They also share objectives and performance metrics with them.
  • Accept failure as normal: SREs embrace risks using the error budget for new features. They quantify failure through SLIs and SLOs. And they run postmortems in a blameless culture.
  • Implement gradual changes: SREs work to increase reliability, and more reliable systems allow more frequent changes and releases.
  • Leverage tooling and automation: SREs eliminate toil by automating operational tasks at a constant pace.
  • Measure everything: SREs measure reliability by implementing MELT data and observability. They also have ways to identify and size toil.

Software and site reliability engineers

Another frequently asked question is how site reliability engineers differ from software engineers (SWEs). The short answer is simple: they have the same core skills but specific work scopes.

What are SWEs? SWEs design, engineer, and architect applications using modeling languages and requirements analysis techniques. They implement an integrated development environment (IDE) and develop code for use cases using one of the multiple available programming languages. They create test cases and testing suites. Also, they integrate software and service components and handle their dependencies. SWEs work with many software development life cycle tools and processes.

Site reliability engineers may execute the same activities, but they intend to improve reliability when doing so. For instance, developing code for an SRE translates much more to instrumenting the application code, so it generates more logs, than coding a use case. Also, SREs treat operations as a software problem and see daily systems management tasks as possible software coding opportunities. Besides that, SREs have other core skills, relating to systems thinking, systems management, and data science.

Indeed, an SRE could become an SWE and vice versa, and that leads us to another principle that we find in the Google materials.

Common staffing pool

Another principle is hiring site reliability engineers and SWEs from the same staffing pool. This principle works well for companies where most employees are software developers and engineers, and having a shared pool means that site reliability and software engineering job roles are interchangeable. However, this principle may be much more challenging for enterprises with a mix of systems administrators and developers. Hence, we left it out of our list in the previous section.

We could compare the SRE’s unique profession to many others, but we limited this topic to the most common comparisons. SREs are not architects, developers, systems administrators, or data scientists; they are more than all of these roles combined. Up next, we are going to understand the primary responsibilities of an SRE.