Real-World SRE

Real-World SRE

By : Pavlos Ratis, Nat Welch

Buy this Book

Real-World SRE

By: Pavlos Ratis, Nat Welch

Buy this Book

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Real-World SRE

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Introduction

A brief history

What is SRE?

What is in the book?

SRE as a framework for new projects

Summary

References

Monitoring

Why monitoring?

Instrumenting an application

Collecting and saving monitoring data

Displaying monitoring information

Managing and maintaining monitoring data

Communicating about monitoring

References and related reading

Summary

Incident Response

What is an incident?

What is incident response?

Alerting

Being on call

Communication

Recovering the system

Calling all clear

Summary

Postmortems

What is a postmortem?

Why write a postmortem?

When to write a postmortem document

Carrying out incident analysis

How to write a postmortem document

Blameless postmortems

Holding a postmortem meeting

Analyzing past postmortems

Summary

References

Testing and Releasing

Testing

Releasing

Automation

Summary

Capacity Planning

A quick introduction to business finance

Why plan?

Defining a plan

Architecture–where performance changes come from

Tech as a profit center and procurement

Summary

Building Tools

Documenting and maintaining projects

Summary

User Experience

An introduction to design and UX

Summary

Networking Foundations

The internet

Sending an HTTP request

Tools for watching the network

Summary

Linux and Cloud Foundations

Linux fundamentals

Cloud fundamentals

Units of scale

Example architecture interview

Summary

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

SRE as a framework for new projects

One way to use this book is as a framework for working on a new project. As each chapter is about a different level of the hierarchy, you can work through the book to figure out where in the hierarchy your project sits. If it is a new project, then often it will be right at the bottom of the hierarchy, with no, or very little, monitoring implemented.

At each level, if there are others on the team, then you should begin a conversation to figure out what exists, and if it meets the team's needs. Each chapter will provide a rough rubric for that discussion, but remember that every team and project is unique. If you are the only person who is thinking about reliability and infrastructure, then you may end up spending a significant amount of time proposing solutions and pushing the project in a certain direction. Just remember that the point is to improve the reliability of the service, help the business, and improve the user's experience of the service.

You may find yourself distracted by each thing that you could fix. It is highly recommended to document the problems that you see first before diving in. Documenting first can be helpful in a few ways. Diving in is very satisfying, but it also may lead you to skip over requirements or spend too much time on a solution that doesn't work for your business (for example, integrating your system with a monitoring service you can't afford, or building a distributed job scheduler when you could have just used a piece of open source software).

So, when joining a new project, or evaluating a new service, here is a set of steps to follow:

Figure out the team structure. Who owns what? Who is in charge?
Find any documentation the team has for their service or the project.

Get someone to draw out the system architecture. Have them show you what connects to which service, what depends on the project, how data flows through the service, and how the project is deployed.

Figure 3: An example system architecture diagram. This is a very simple diagram that someone might draw on a whiteboard. Most companies will have something much more complex or detailed than this, but this is often the level of detail you need. Boxes with names and arrows show what talks to what.

Figure 4: Second example of an architecture diagram. This system is a classic static site generator model. The admin service creates or modifies things and writes update notifications into a queue. A worker reads data from the queue, does work on the data, and uploads it to a static object store, in this case vendor 2. Then, we put in some sort of CDN or serving system, in this case vendor 1 in front of vendor 2.

Name	Role	Manager	Things they know/specializations
Akil	Junior Full Stack Dev	Jeff	Seems pretty new and jumps around a lot.
Catherine	Senior Frontend Dev	Jeff	Does a lot of initial design prototyping and built most of the frontend originally.
Kareem	Senior Mobile Dev	Melissa	Wrote both mobile apps.
Steph	Senior Backend Dev	Melissa	TO DO: Set up a one-on-one to understand mobile backend.
Suzy	Full Stack Dev	Jeff	Animation wizard who knows the database for CMS better than anyone.
Tom	Full Stack Dev	Jeff	Frontend architecture, made initial protocol buffers and knows sync queue best.

Table 1: An example table with notes on people in the project. With this, we have a reference on team structure. If we need to know who to talk to about mobile apps, we can look at our handy chart and see that we need to talk to Kareem or the manager, Melissa.

Now that you have context for the project, or service, start working through each chapter of the book and ask:

Does the service have monitoring?
Does the team have plans for incident response?
Does the team create postmortems? Are they stored anywhere?
How is the service tested? Does the project have a release plan?
Has anyone done any capacity planning?
What tools could we build to improve the service?
Is the current level of reliability providing a positive user experience?

Note

The trick to note here is that these questions could be asked about a piece of software that has been running for years, as well as one that is just being created.

The service you are investigating could be a large project with many pieces of software (a service-oriented architecture (SOA) for example) or a single monolithic application. If you are working on a project with many services, then work through each service one at a time. The downside of this can be that if you want to build a framework that will fit all of the services you are interacting with, you will not know how best to solve the problems and needs of them until after you have done a bunch of research and work. The upside is that you will not be pulled immediately in many directions and will be able to focus on one specific service's problems.

Your time and energy are limited resources and, because of this, you will always need to work with more people than you have time for, so make sure to take it slow. Going slow will mean that things do not get lost in the cracks. You also do not want to burn out before each service has its base few levels of its hierarchy filled up.

Real-World SRE

By : Pavlos Ratis, Nat Welch

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Related Content you might be interested in

Current Title:

Real-World SRE

Becoming a Rockstar SRE

SRE as a framework for new projects

Note