Book Image

Real-World SRE

By : Nat Welch
Book Image

Real-World SRE

By: Nat Welch

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.
Table of Contents (13 chapters)

A brief history

SRE is a relatively new field, but it is a slightly different take on many existing ideas. In 1958, the term IT was coined in the Harvard Business Review, and eventually became the descriptor for the maintenance of technology used for collecting, storing, and distributing data and information. At that time, computers were transitioning toward having integrated circuits, but they were still the size of a room and were maintained and programmed by a team of people. As computers shrank, that team started focusing on multiple computers. Over time, some people started to specialize in programming those computers, and others focused on keeping them running. "Dumb terminals" would connect to a single computer, which was maintained by a team while programmers and users used the terminals.

Eventually, these maintainers started taking care of both the machines that individuals used, as well as large arrays of machines that provided services. Users would use a word processor on their local machine, and then upload files to a remote machine. Those who maintained the remote machines became known as system engineers, system administrators, and system operators.

As computers became smaller and more commodified, programmers began spending more time interacting with infrastructure, and configuring their software and infrastructure to work together well. On the other end, system admins were writing more and more complex code to maintain infrastructure. The closer these teams became, the more they began working together. In smaller teams, often, people would start focusing on both code for infrastructure and business code. In larger organizations, teams were created that focused on tools for managing infrastructure in reliable ways, so that product teams could quickly and easily manage the infrastructure they needed. These joint teams were often described as SRE or DevOps (developer and operations) teams.

Benjamin Treynor Sloss of Google, often referred to as just Treynor, says in Google's Site Reliability Engineering book, "SRE is what happens when you ask a software engineer to design an operations team." He is often credited with the creation of the idea that operations work is now just a specialization of software engineering. Given Google's success with reliability, the idea has caught on at many companies.

SRE is still a burgeoning field and, like DevOps, is often used to describe roles that include a wide diversity of work. Some companies give the title of SRE to a position, but it is much closer to a traditional system admin role. You can use this book's framework to evaluate a job before you apply for it, however, the goal of this book is to introduce you to the SRE mindset and help you to apply it to an organization, regardless of your past experience in the tech world.