Real-World SRE

Real-World SRE

By : Pavlos Ratis, Nat Welch

Buy this Book

Real-World SRE

By: Pavlos Ratis, Nat Welch

Buy this Book

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Real-World SRE

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Introduction

A brief history

What is SRE?

What is in the book?

SRE as a framework for new projects

Summary

References

Monitoring

Why monitoring?

Instrumenting an application

Collecting and saving monitoring data

Displaying monitoring information

Managing and maintaining monitoring data

Communicating about monitoring

References and related reading

Summary

Incident Response

What is an incident?

What is incident response?

Alerting

Being on call

Communication

Recovering the system

Calling all clear

Summary

Postmortems

What is a postmortem?

Why write a postmortem?

When to write a postmortem document

Carrying out incident analysis

How to write a postmortem document

Blameless postmortems

Holding a postmortem meeting

Analyzing past postmortems

Summary

References

Testing and Releasing

Testing

Releasing

Automation

Summary

Capacity Planning

A quick introduction to business finance

Why plan?

Defining a plan

Architecture–where performance changes come from

Tech as a profit center and procurement

Summary

Building Tools

Documenting and maintaining projects

Summary

User Experience

An introduction to design and UX

Summary

Networking Foundations

The internet

Sending an HTTP request

Tools for watching the network

Summary

Linux and Cloud Foundations

Linux fundamentals

Cloud fundamentals

Units of scale

Example architecture interview

Summary

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Summary

In this chapter, we talked about what postmortems are, when we should write them, and how to write, discuss and analyze them. We also talked about keeping blame out of postmortems and helping your team to prevent future issues by prioritizing what comes out of the postmortem process.

As an SRE, or as part of an organization that wants reliability, postmortems are the transition from dealing with the present to dealing with the future. Everything above postmortems in the hierarchy is about the future—planning and improving processes. Everything below (monitoring and incident response) is about dealing with the present. Postmortems allow us to look at the past, before we start thinking about the future.

In the next chapter, we will talk about testing and releasing, where we think about code we have written and how we introduce it into the world. We have finished looking at incidents and now will move on to the constant evolution of our products and services.

Real-World SRE

By : Pavlos Ratis, Nat Welch

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Related Content you might be interested in

Current Title:

Real-World SRE

Becoming a Rockstar SRE

Summary