Real-World SRE

Real-World SRE

By : Pavlos Ratis, Nat Welch

Buy this Book

Real-World SRE

By: Pavlos Ratis, Nat Welch

Buy this Book

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Real-World SRE

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Introduction

A brief history

What is SRE?

What is in the book?

SRE as a framework for new projects

Summary

References

Monitoring

Why monitoring?

Instrumenting an application

Collecting and saving monitoring data

Displaying monitoring information

Managing and maintaining monitoring data

Communicating about monitoring

References and related reading

Summary

Incident Response

What is an incident?

What is incident response?

Alerting

Being on call

Communication

Recovering the system

Calling all clear

Summary

Postmortems

What is a postmortem?

Why write a postmortem?

When to write a postmortem document

Carrying out incident analysis

How to write a postmortem document

Blameless postmortems

Holding a postmortem meeting

Analyzing past postmortems

Summary

References

Testing and Releasing

Testing

Releasing

Automation

Summary

Capacity Planning

A quick introduction to business finance

Why plan?

Defining a plan

Architecture–where performance changes come from

Tech as a profit center and procurement

Summary

Building Tools

Documenting and maintaining projects

Summary

User Experience

An introduction to design and UX

Summary

Networking Foundations

The internet

Sending an HTTP request

Tools for watching the network

Summary

Linux and Cloud Foundations

Linux fundamentals

Cloud fundamentals

Units of scale

Example architecture interview

Summary

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

A brief history

SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.

Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.

As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.

Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.

SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.

Real-World SRE

By : Pavlos Ratis, Nat Welch

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Related Content you might be interested in

Current Title:

Real-World SRE

Becoming a Rockstar SRE

A brief history