Book Image

Becoming a Rockstar SRE

By : Jeremy Proffitt, Rod Anami
Book Image

Becoming a Rockstar SRE

By: Jeremy Proffitt, Rod Anami

Overview of this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Table of Contents (27 chapters)
1
Part 1 - Understanding the Basics of Who, What, and Why
5
Part 2 - Implementing Observability for Site Reliability Engineering
10
Part 3 - Applying Architecture for Reliability
16
Part 4 - Mastering the Outage Moments
20
Part 5 - Looking into Future Trends and Preparing for SRE Interviews

Timeline

This outage started at 5:37 P.M. and ended at 8:23 P.M. Here is the detailed timeline:

  • 5:37 P.M. – Channel 9 aired a segment on our volunteerism, driving customers to our website.
  • 5:49 P.M. – The alarm for pricing taking too long fired.
  • 6:06 P.M. – The secondary on-call engineer responded to the call, started a bridge, and started investigating.
  • 6:19 P.M. – The call center manager joined the call in progress, which included two DevOps engineers, the secondary on-call engineer, the product owner, and a junior developer.
  • 6:37 P.M. – Additional containers were added to the web application layer
  • 6:46 P.M. – Caching was enabled on DynamoDB.
  • 7:24 P.M. – Read and write capacity was adjusted for DynamoDB.
  • 7:37 P.M. – Caching was disabled on DynamoDB.
  • 7:41 P.M. – The system was operating properly.
  • 8:23 P.M. – Marketing sent out targeted emails, offering a discount to those...