An overview of the daily activities of an SRE

SLA commitment – a conversation, not a number

Chapter 2: Fundamental Numbers – Reliability Statistics

Defining and leveraging SLOs and SLIs

Measuring the downtime with the MTTR

Understanding the customer and revenue impact

The business of software development – let’s start with the dollars

Chapter 3: Imperfect Habits – Duct Tape Architecture and Spaghetti Code

The A/B testing mindset – the art of change in customer interaction

Dedication to the craft of development – and why some are just here for a job

Reviewing the merge request – it’s about training, oversight, and reliability

Why businesses want us to outright ignore best practices

Mixing good and bad – tricks to wrapping bad code and making it resilient

Part 2 - Implementing Observability for Site Reliability Engineering

Chapter 4: Essential Observability – Metrics, Events, Logs, and Traces (MELT)

Accomplishing systems monitoring and telemetry

Understanding APM

Getting to know topology self-discovery, the blast radius, predictability, and correlation

Alerting – the art of doing it quietly

Mixing everything into observability

In practice – applying what you have learned

Properly defining the problem – and what to ask and not ask

Chapter 5: Resolution Path – Master Troubleshooting

Breaking down and testing systems

Previous and common events – checking for the simple problems

Effective research both online and among peers

Breaking down source code efficiently

Logging plus code

Chapter 6: Operational Framework – Managing Infrastructure and Systems

Approaching systems administration as a discipline

Understanding IT service management

Seeing systems administration as multiple layers and multiple towers

Automating systems provisioning and management

Part 3 - Applying Architecture for Reliability

Chapter 8: Reliable Architecture – Systems Strategy and Design

Designing for reliability

Splitting and balancing the workload

Failing over – almost as good

Scaling up and out – horizontal versus vertical

Chapter 9: Valued Automation – Toil Discovery and Elimination

Treating automation as a software problem

Eliminating toil

Automating the (in)famous CI/CD pipeline

A basic pipeline – building automation to deploy infrastructure as code architecture and code

Chapter 10: Exposing Pipelines – GitOps and Testing Essentials

Automating compliance and security in pipelines

Automated linting for code quality and standards

Validating functionality during deployment with automated testing

The reduction of developer toil through automated processes

Chapter 11: Worker Bees – Orchestrations of Serverless, Containers, and Kubernetes

The multiple definitions of serverless

Containers and why we love them

Kubernetes and other ways to orchestrate containers

Deployment techniques and workers

Automation and rolling back failed deployments

Chapter 12: Final Exam – Tests and Capacity Planning

Understanding types of testing

Adopting TDD

Using test automation frameworks

Staying ahead with capacity planning

Part 4 - Mastering the Outage Moments

Chapter 13: First Thing – Runbooks and Low Noise Outage Notifications

What makes a good runbook – the basics

Beyond the runbook – code and comments

What’s in a good dashboard?

The basics of priority levels

Where to meet – an effective strategy for communicating good information

Chapter 14: Rapid Response – Outage Management Techniques

Leveraging the people involved in the response

The opportunity to respond at the right time

Messaging customers and leadership

The content of the postmortem in executive summary style

Chapter 15: Postmortem Candor – Long-Term Resolution

Decisions are not blame

The cost of more reliability as a business decision

Training and skill sets – they matter

Creating future action plans

In-practice – an example of a postmortem

Custom Hat Company postmortem

Impact

Timeline

Technical details and response

Resolution

Future actions

Part 5 - Looking into Future Trends and Preparing for SRE Interviews

Chapter 16: Chaos Injector – Advanced Systems Stability

Comprehending the wheel-of-misfortune game

Understanding chaos engineering for reliability

In practice – employing the wheel-of-misfortune game

In practice – injecting chaos into systems

What we’re looking for in a candidate

Chapter 17: Interview Advice – Hiring and Being Hired

Common interview questions and answers

Researching the company

Are you over-or under-certified?

Tips for landing the job with a great salary