Book Image

Real-World SRE

By : Nat Welch
Book Image

Real-World SRE

By: Nat Welch

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.
Table of Contents (13 chapters)

Holding a postmortem meeting

A postmortem meeting is the follow-up to the document. It often finalizes action items, discusses the findings of the root cause, and offers a safe setting for discussion. Usually, the best set of people to invite to a postmortem meeting are those involved in the incident—the tech lead for services affected, the product manager for the affected services, and any interested engineers. Finding the right balance is hard, because if the meeting is too large, you will not get much done, but if the meeting is too small, knowledge won't be disseminated well.

It's a good idea to have those involved in the incident present because they will know what happened in case data is missing from the document. Tech leads from affected services should be there in case assumptions are made about their services that aren't true and to accept responsibility to make sure that the action items get implemented. Product managers for affected services are important because they can help...