Book Image

Real-World SRE

By : Pavlos Ratis, Nat Welch
Book Image

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.
Table of Contents (16 chapters)
Real-World SRE
Contributors
Preface
Other Books You May Enjoy
Index

What is in the book?


I worked as an SRE at Google for four years, and that is where I started specializing, moving away from being a full stack engineer, and instead considering myself an SRE. Google had lots of internal education courses, and when I left, I found it difficult to continue my education. I also quickly discovered that SRE at Google is a very different beast than SRE at much smaller organizations. I decided to write this book for people interested in starting with SRE or applying it to organizations that are much smaller than Google.

To do this, the book is broken up into two parts. The first eight chapters walk through the hierarchy of reliability. This hierarchy was originally designed by Mikey Dickerson of the United States Digital Service (and– surprise, surprise –Google). The hierarchy says that as you are trying to add reliability to a system, you need to walk through each level before you get to the next one.

The following diagram shows a slightly modified version of Mikey's original pyramid. I have updated it to include the all-encompassing aspect of communication:

Figure 2: This seven-layer pyramid is encircled with communication. Each layer builds upon and needs the previous layer. It is surrounded by communication because each layer needs communication to succeed.

Let us walk through the layers as a preview of what you can expect in each chapter.

  • Chapter 2, Monitoring: The first level is monitoring, which makes sure that you have insight into a system, tracking health, availability, and what is happening internally in the system. Monitoring is not just tools though, because it also requires communication. Monitoring is a very contentious part of SRE and operations because, depending on implementation, it can either be very useful or very pointless. Figuring out what to monitor, how to monitor it, where to store the monitoring data, who can access historical monitoring data, and how to look at data often takes time. Many people in your engineering organization will have opinions on these points based on past experiences.

    Some engineers will have had bad experiences and will not think monitoring is worth the investment, whereas others will have religious zealotry toward certain tools, and some will just ignore you. This chapter will help you to navigate all of these competing opinions and find and create the implementation that is best for your project and team.

  • Chapter 3, Incident Response: The next level is incident response. If something is broken, how do you alert people and respond? While tools help with this, as they define the rules by which to alert humans, most of incident response is about defining policy and setting up training so humans know what to do when they get alerts. If team members see an automated message in Slack, what should they do? If they get a phone call, how quickly do they need to respond? Will employees be paid extra if they have to work on a Saturday due to an outage? These are all questions we will address in the What is incident response section. Setting up on-call rotations, best practices for working together as a team, and building infrastructure to make incidents as low-stress as possible will also be covered.

  • Chapter 4, Postmortems: The third level is postmortems. Once you have had an outage, how do you make sure the problem does not happen again? Should you have a meeting about your incident? Does there need to be documentation? In this chapter, we will consider how to talk about past incidents and make it an enjoyable process for all involved. Postmortems are the act of recording for history how an incident happened, how the team fixed it, and how the team is working to prevent another similar incident in the future. We want to set up a culture of blameless and transparent postmortems, so people can work together.

    Individuals should not be afraid of incidents, but rather feel confident that if an incident happens, the team will respond and improve the system for the future, instead of focusing on the shame and anger that can come with failure. Incidents are things to learn from, not things to be afraid and ashamed of!

  • Chapter 5, Testing and Releasing: The fourth level is testing and releasing your software. In this chapter, we will be talking about the tooling and strategies that can be used to test and release software. This level in the hierarchy is our first level where instead of focusing on things that have happened, we focus on prevention. Prevention is about trying to limit the number of incidents that happen and also making sure that infrastructure and services stay stable when releasing new code. The chapter will talk about how to focus on all of the different types of testing that exist and make them useful for you and your team. It will also explore releasing software, when to use methodologies like continuous deployment, and some tools you can use.

  • Chapter 6, Capacity Planning: The fifth level is capacity planning. While Chapter 5, Testing and Releasing focused on the current world, this chapter is all about predicting the future and finding the limits of your system. Capacity planning is also about making sure you can grow over time. Once you are monitoring your system, and running a reliable system, you can start thinking about how to grow it over time, and how to find and anticipate bottlenecks and resource limits. In this chapter, we will talk about planning for long-term growth, writing budgets, communicating with outside teams about the future, and things to keep in mind as your service shrinks and grows.

  • Chapter 7, Building Tools: The sixth level is the development of new tools and services. SRE is not only about operations but also about software development. We hope SREs will spend around half of their time developing new tools and services. Some of these tools will exist to automate tasks that an employee has been doing by hand, while others will exist to improve another part of the hierarchy, such as automated load testing, or services to improve performance. In this chapter, we will talk about finding these projects, defining them, planning them, and building them. We will also talk about communicating their usefulness to your fellow engineers.

  • Chapter 8, User Experience: The final tier is user experience, which is about making sure the user has a good experience. We'll talk about measuring performance, working with user researchers, and defining what a good experience means to your team. We will also discuss how the experience of a tool and processes can cause outages. The goal is to make sure that, no matter the tool, or the user, people enjoy using it, understand how to use it, and cannot easily hurt themselves with it.

    Nori Heikkinen, an SRE at Google with many years of experience, adds that "the hierarchy does not include prevention, partly because 100% uptime is impossible, and partly because the bottom three needs in the hierarchy must be addressed within an organization before prevention can be examined." (https://www.infoq.com/news/2015/06/too-big-to-fail)

    The last two chapters of this book are a cheat section and introduction to common useful topics.

  • Chapter 9, Networking Foundations: This is a selection of tools and definitions of important ideas in networking. We discuss network packets, DNS, UDP and TCP, and lots of other things. After this chapter you should feel like you know the basics of networking, and the ability to research more advanced topics.

  • Chapter 10, Linux and Cloud Foundations: This is a selection of tools and important concepts involved in Linux and modern cloud products. We cover what the Linux kernel is, common parts of public clouds, and other topics. After this chapter you should feel like you know the basics of Linux and most public cloud products. Afterwards you should feel comfortable researching specific clouds and more advanced Linux topics.