Book Image

Becoming a Rockstar SRE

By : Jeremy Proffitt, Rod Anami

Book Image

Becoming a Rockstar SRE

By: Jeremy Proffitt, Rod Anami

Overview of this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!

Preface

Who is this book for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share Your Thoughts

Download a free PDF copy of this book

Part 1 - Understanding the Basics of Who, What, and Why

Part 1 - Understanding the Basics of Who, What, and Why

Free Chapter

Chapter 1: SRE Job Role – Activities and Responsibilities

Chapter 1: SRE Job Role – Activities and Responsibilities

Making this journey personal

Understanding the mindset and hobbies of an SRE

DevOps engineers versus SRE versus others

Describing an SRE’s main responsibilities

An overview of the daily activities of an SRE

People that inspire

Further reading

Chapter 2: Fundamental Numbers – Reliability Statistics

Chapter 2: Fundamental Numbers – Reliability Statistics

SLA commitment – a conversation, not a number

Defining and leveraging SLOs and SLIs

Measuring the downtime with the MTTR

Understanding the customer and revenue impact

Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

The business of software development – let’s start with the dollars

The A/B testing mindset – the art of change in customer interaction

Dedication to the craft of development – and why some are just here for a job

Reviewing the merge request – it’s about training, oversight, and reliability

Why businesses want us to outright ignore best practices

Mixing good and bad – tricks to wrapping bad code and making it resilient

Part 2 - Implementing Observability for Site Reliability Engineering

Part 2 - Implementing Observability for Site Reliability Engineering

Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

Technical requirements

Accomplishing systems monitoring and telemetry

Understanding APM

Getting to know topology self-discovery, the blast radius, predictability, and correlation

Alerting – the art of doing it quietly

Mixing everything into observability

In practice – applying what you have learned

Further reading

Chapter 5: Resolution Path – Master Troubleshooting

Chapter 5: Resolution Path – Master Troubleshooting

Properly defining the problem – and what to ask and not ask

Breaking down and testing systems

Previous and common events – checking for the simple problems

Effective research both online and among peers

Breaking down source code efficiently

Logging plus code

In practice – applying what you’ve learned

Chapter 6: Operational Framework – Managing Infrastructure and Systems

Chapter 6: Operational Framework – Managing Infrastructure and Systems

Technical requirements

Approaching systems administration as a discipline

Understanding IT service management

Seeing systems administration as multiple layers and multiple towers

Automating systems provisioning and management

In practice – applying what you’ve learned

Further readings

Chapter 7: Data Consumed – Observability Data Science

Chapter 7: Data Consumed – Observability Data Science

Technical requirements

Making data-driven decisions

Solving problems through a scientific approach

Understanding the most common statistical methods

Using other mathematical models in observability

Visualizing histograms with Grafana

In practice – applying what you’ve learned

Further reading

Part 3 - Applying Architecture for Reliability

Part 3 - Applying Architecture for Reliability

Chapter 8: Reliable Architecture – Systems Strategy and Design

Chapter 8: Reliable Architecture – Systems Strategy and Design

Technical requirements

Designing for reliability

Splitting and balancing the workload

Failing over – almost as good

Scaling up and out – horizontal versus vertical

In practice – applying what you’ve learned

Further reading

Chapter 9: Valued Automation – Toil Discovery and Elimination

Chapter 9: Valued Automation – Toil Discovery and Elimination

Technical requirements

Eliminating toil

Treating automation as a software problem

Automating the (in)famous CI/CD pipeline

In practice – applying what you’ve learned

Further reading

Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

A basic pipeline – building automation to deploy infrastructure as code architecture and code

Automating compliance and security in pipelines

Automated linting for code quality and standards

Validating functionality during deployment with automated testing

The reduction of developer toil through automated processes

In practice – applying what you’ve learned

Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

Technical requirements

The multiple definitions of serverless

Containers and why we love them

Kubernetes and other ways to orchestrate containers

Deployment techniques and workers

Automation and rolling back failed deployments

In practice – applying what you’ve learned

Chapter 12: Final Exam – Tests and Capacity Planning

Chapter 12: Final Exam – Tests and Capacity Planning

Technical requirements

Understanding types of testing

Using test automation frameworks

Staying ahead with capacity planning

In practice – applying what you’ve learned

Further reading

Part 4 - Mastering the Outage Moments

Part 4 - Mastering the Outage Moments

Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

Technical requirements

What makes a good runbook – the basics

Beyond the runbook – code and comments

What’s in a good dashboard?

The basics of priority levels

In practice – applying what you’ve learned

Chapter 14: Rapid Response – Outage Management Techniques

Chapter 14: Rapid Response – Outage Management Techniques

Where to meet – an effective strategy for communicating good information

Leveraging the people involved in the response

The opportunity to respond at the right time

Messaging customers and leadership

In practice – applying what you’ve learned

Chapter 15: Postmortem Candor – Long-Term Resolution

Chapter 15: Postmortem Candor – Long-Term Resolution

The content of the postmortem in executive summary style

Decisions are not blame

The cost of more reliability as a business decision

Training and skill sets – they matter

Creating future action plans

In-practice – an example of a postmortem

Custom Hat Company postmortem

Technical details and response

Part 5 - Looking into Future Trends and Preparing for SRE Interviews

Part 5 - Looking into Future Trends and Preparing for SRE Interviews

Chapter 16: Chaos Injector – Advanced Systems Stability

Chapter 16: Chaos Injector – Advanced Systems Stability

Technical requirements

Comprehending the wheel-of-misfortune game

Understanding chaos engineering for reliability

In practice – employing the wheel-of-misfortune game

In practice – injecting chaos into systems

Further reading

Chapter 17: Interview Advice – Hiring and Being Hired

Chapter 17: Interview Advice – Hiring and Being Hired

What we’re looking for in a candidate

Common interview questions and answers

Researching the company

Are you over-or under-certified?

Tips for landing the job with a great salary

Index

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Appendix A – The Site Reliability Engineer Manifesto

Appendix A – The Site Reliability Engineer Manifesto

How to adopt it

How to contribute to it

Appendix B – The 12-Factor App Questionnaire

Appendix B – The 12-Factor App Questionnaire

The questionnaire

How to adopt this questionnaire

How to contribute to this questionnaire

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Measuring the downtime with the MTTR

The MTTR is the average amount of time an issue takes to be resolved. It is generated from the average of that time span. The MTTR is often thought of as response time – how effectively a fire department can get to your fire and put it out. It is also the amount of time we are often impacted by each outage, so when the MTTR goes up, we often see a decrease in revenue and customer satisfaction. The MTTR has multiple smaller elements inside of it, each contributing to the overall outage time. Let’s step through a typical outage and quickly examine each of these elements:

Detection time: The time between the outage start and when someone noticed it. This often starts with the root cause and measures up until the first person or automated notification says that something is wrong.
Notification time: The time it takes between detection and when engineering assets first respond. This could be the time it takes for someone to...