Real-World SRE

By : Pavlos Ratis, Nat Welch

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it. Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response. Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis. The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

Real-World SRE

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Introduction

A brief history

What is SRE?

What is in the book?

SRE as a framework for new projects

Summary

References

Monitoring

Why monitoring?

Instrumenting an application

Collecting and saving monitoring data

Displaying monitoring information

Managing and maintaining monitoring data

Communicating about monitoring

References and related reading

Summary

Incident Response

What is an incident?

What is incident response?

Alerting

Being on call

Communication

Recovering the system

Calling all clear

Summary

Postmortems

What is a postmortem?

Why write a postmortem?

When to write a postmortem document

Carrying out incident analysis

How to write a postmortem document

Blameless postmortems

Holding a postmortem meeting

Analyzing past postmortems

Summary

References

Testing and Releasing

Testing

Releasing

Automation

Summary

Capacity Planning

A quick introduction to business finance

Why plan?

Defining a plan

Architecture–where performance changes come from

Tech as a profit center and procurement

Summary

Building Tools

Documenting and maintaining projects

Summary

User Experience

An introduction to design and UX

Summary

Networking Foundations

The internet

Sending an HTTP request

Tools for watching the network

Summary

Linux and Cloud Foundations

Linux fundamentals

Cloud fundamentals

Units of scale

Example architecture interview

Summary

References

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Index

A

ACM code of ethics / ACM code of ethics
action items / Action items
Address Resolution Protocol (ARP) / IP
alert
- situations / When do you alert?
- types / When do you alert?
- sending / How do you alert?
- services / Alerting services
- initial string / What is in an alert?
- audience, selecting / Who do you alert?
alerting / Alerting
all clear scenario
- declaring / Calling all clear
alterting / Alerting
Amazon Web Services (AWS) / Real-world interaction design, Ethernet and TCP/IP, Cloud fundamentals
Ansible / How should we change our capacity?
application
- instrumenting / Instrumenting an application
- measuring / What should we measure?
application programming interface (API) / Separation of concerns
authentication
- suggestions / Authentication
authorization
- about / Authorization
automation
- about / Automation
- building / Automation
- testing / Automation
- distributing / Automation
- as continuous category / Continuous everything
autoscaling / Autoscaling

B

Beats / ELK
Best Current Practice (BCP) / A quick introduction to business finance
best practices, code writing
- version control, using / Advice for writing code
- code reviews, conducting / Advice for writing code
- designated owner, assigning to projects / Advice for writing code
- humans / Advice for writing code
blameless postmortems
- about / Blameless postmortems
Border Gateway Protocol (BGP) / IP
business finance
- about / A quick introduction to business finance
- economics jargon / A quick introduction to business finance

C

Cacti / Cacti
capacity, modifying
- about / How should we change our capacity?
- state and concurrency, checking / State and concurrency
- service limitations / Is your service limited by another service?
- events, scaling for / Scaling for events
- user-generated content (UGC) / Unpredictable growth–user-generated content
- preplanned, versus autoscaling / Preplanned versus autoscaling
- delivery criteria / Delivering
categories, testing code
- unit tests / Unit, feature, and integration tests, Unit tests
- feature tests / Unit, feature, and integration tests, Feature tests
- integration tests / Unit, feature, and integration tests, Integration tests
Chaos Engineering / Testing infrastructure
checklist, for software writing
- monitoring / Building projects
- incident response / Building projects
- postmortems / Building projects
- testing and releasing / Building projects
- capacity planning / Building projects
Chef / How should we change our capacity?
Classless Inter-Domain Routing (CIDR) / CIDR notation
cloud
- containers / Containers
- load balancing / Load balancing
- queues / Queues and Pub/Sub
- Pub/Sub / Queues and Pub/Sub
Cloud
- fundamentals / Cloud fundamentals
- VMs / VMs
- autoscaling / Autoscaling
- storage, types / Storage
CMS (content management systems) / Architecture–where performance changes come from
command line interface (CLI) / An introduction to design and UX
communication
- starting / Communication
- Incident Command System (ICS), using / Incident Command System (ICS)
- instances / Where do you communicate?
containers / Containers
Content Distribution Network (CDN) / Architecture–where performance changes come from
content management system (CMS) / Developer experience
cron job / Preplanned versus autoscaling
curl / curl and wget

D

data recovery testing / Testing infrastructure
dd tool / How should we change our capacity?
design
- about / An introduction to design and UX
devices / Devices
Disaster Recovery Testing (DIRT)
- reference / Testing processes
DNS (domain name system)
- about / DNS
- Top Level Domain (TLD) / DNS
- dots / DNS
- Second Level Domain (SLD) / DNS
- subdomain / DNS
- configuring / DNS
- dig / dig
Docker / How should we change our capacity?

E

economics jargon, business finance
- cash flow / A quick introduction to business finance
- Profits and losses (P&L) / A quick introduction to business finance
- balance sheet / A quick introduction to business finance
- Capex / A quick introduction to business finance
- Opex / A quick introduction to business finance
- Return on investment (ROI) / A quick introduction to business finance
- cost center versus profit center / A quick introduction to business finance
ElasticSearch, Logstash, and Kibana (ELK) / ELK
ELK / ELK
Engineering / What is SRE?
error budgets / A short introduction to SLIs, SLOs, and error budgets, Error budgets
Ethernet / Ethernet and TCP/IP, Ethernet
example architecture interview / Example architecture interview

F

18F / Design documents
Federal Aviation Administration (FAA) / Alerting
Free Software Foundation (FSF) / Linux fundamentals

G

General Public License (GPL) / Linux fundamentals
Glad Mad Sad framework / Retrospectives and standups
Go language
- reference / References and related reading
goreplay
- reference / Testing infrastructure
graphical user interface (GUI) / An introduction to design and UX

H

HTTP (Hypertext Transfer Protocol) / HTTP
HTTP request
- sending / Sending an HTTP request
- DNS (domain name system) / DNS
- Ethernet / Ethernet and TCP/IP
- TCP/IP / Ethernet and TCP/IP
- HTTP / HTTP
- wget / curl and wget
- curl / curl and wget
human resources (HR) / A quick introduction to business finance

I

incident / What is an incident?
incident analysis
- carrying out / Carrying out incident analysis
incident response
- about / What is incident response?
- actions / What is incident response?
infrastructure
- testing / Testing infrastructure
Infrastructure as a Service (IaaS) / Preplanned versus autoscaling
inodes
- reference / Files, directories, and inodes
Interior Gateway Protocol (IGP) / IP
internet / The internet
Internet Control Message Protocol (ICMP) / ICMP
Internet Service Providers (ISPs) / The internet
IP / IP

J

JavaScript Chart Library / When are we going to run out of capacity?

L

4Ls
- Learned / Retrospectives and standups
- Liked / Retrospectives and standups
- Lacked / Retrospectives and standups
- Longed For / Retrospectives and standups
linting / Code reviews
Linux
- fundamentals / Linux fundamentals
- file / Everything is a file
- directories / Files, directories, and inodes
- inodes / Files, directories, and inodes
- permissions / Permissions
- sockets / Sockets
- devices / Devices
- /proc / /proc
- filesystem layout / Filesystem layout
- process / What is a process?
- syscalls / syscalls
- programs, exploring / Build your own
load balancers (LB) / Ethernet and TCP/IP, Cloud fundamentals
load balancing / Load balancing
LRU (Last Recently Used) / Example architecture interview

M

MAC (media access control) / Ethernet
mean time between failures (MTBF) / MTTR and MTBF
mean time to recovery (MTTR) / MTTR and MTBF
measuring mean time to recovery (MTTR) / Recovering the system
mongoreplay
- reference / Testing infrastructure
monitoring
- need for / Why monitoring?
- awareness, creating / Communicating about monitoring
- about / Do they even know there is monitoring?
monitoring data
- collecting / Collecting and saving monitoring data
- saving / Collecting and saving monitoring data
- polling applications / Polling applications, Push applications
- managing / Managing and maintaining monitoring data
- maintaining / Managing and maintaining monitoring data
monitoring information
- displaying / Displaying monitoring information
- arbitrary queries, using / Arbitrary queries
- graphs, using / Graphs
- dashboards, using / Dashboards
- chatbots, using / Chatbots
multi-factor authentication (MFA)
- about / Real-world interaction design

N

Nagios / Nagios
National Incident Management System's (NIMS) / Incident Command System (ICS)
National Institute of Standards and Technology (NIST) / Sockets
nc / nc
negative testing / Integration tests
netstat / netstat
network
- watching, tools / Tools for watching the network
nines
- 90% (one nine of uptime) / Service levels
- 99% (two nines of uptime) / Service levels
- 99.9% (three nines of uptime) / Service levels
- 99.95% (three and a half nines of uptime) / Service levels
- 99.99% (four nines of uptime) / Service levels
- 99.999% (five nines of uptime) / Service levels
not invented here syndrome (NIHS) / Developer experience

O

objectives and key results (OKRs)
- about / Long-term work
- example / Example OKRs
on-call
- connecting / Being on call
Open Systems Interconnection (OSI) model
- layers / Ethernet and TCP/IP

P

past postmortems
- analyzing / Analyzing past postmortems
- mean time between failures (MTBF) / MTTR and MTBF
- mean time to recovery (MTTR) / MTTR and MTBF
- alert fatigue / Alert fatigue
- past outages, discussing / Discussing past outages
Paxos / State and concurrency
performance changes
- detecting / Architecture–where performance changes come from
phishing
- about / Phishing
plan
- need for / Why plan?
- risk, managing / Managing risk and managing expectations
- expectations, managing / Managing risk and managing expectations
- defining / Defining a plan
- current capacity, measuring / What is our current capacity?
- capacity, running out of / When are we going to run out of capacity?
- capacity, modifying / How should we change our capacity?
- executing / Execute the plan
Platforms as a Service (PaaS) / Cloud fundamentals
polling applications
- about / Polling applications, Push applications
- Nagios / Nagios
- Prometheus / Prometheus
- Cacti / Cacti
- Sensu / Sensu
- StatsD / StatsD
- telegraf / Telegraf
- ELK / ELK
postmortem
- about / What is a postmortem?
- writing / Why write a postmortem?
- root cause / Root cause
- without action items / Postmortems without action items
postmortem-templates
- reference / How to write a postmortem document
postmortem document
- writing, situations / When to write a postmortem document
- writing / How to write a postmortem document
postmortem meeting
- holding / Holding a postmortem meeting
process / What is a process?
- zombies / Zombies
- orphans / Orphans
- nice command / What is nice?
processes
- testing / Testing processes
proc filesystem (procfs) / /proc
projects
- finding / Finding projects
- defining / Defining projects
- Readme Driven Development (RDD) / Defining projects, RDD
- planning / Planning projects
- building / Building projects
- documenting / Documenting and maintaining projects
- maintaining / Documenting and maintaining projects
projects, buidling
- writing code, best practices / Advice for writing code
- separation of concerns / Separation of concerns
- long-term work / Long-term work
- notebooks / Notebooks
projects, planning
- example / Example
- Tak server, building / Example
- retrospectives / Retrospectives and standups
- standups / Retrospectives and standups
- work, allocating / Allocation
Prometheus
- about / Prometheus
- reference / References and related reading
/ When are we going to run out of capacity?
Pub/Sub / Queues and Pub/Sub
Pub/Sub queues / Queues and Pub/Sub
Puppet / How should we change our capacity?

Q

query-playback
- reference / Testing infrastructure
queues / Queues and Pub/Sub

R

Raft / State and concurrency
Readme Driven Development (RDD)
- ABOUT / RDD
- about / RDD
- example / Example
- design documents / Design documents
real-world interaction design
- about / Real-world interaction design
red herring / Carrying out incident analysis
REL (requests, errors, latency) / Why monitoring?
release
- validating / Validating your release
releasing
- about / Releasing
- situations / When to release
- to production / Releasing to production
Reliability / What is SRE?
risk profile
- about / Risk profile
rollbacks
- about / Rollbacks
root cause analysis (RCA) / What is a postmortem?
Ruby 2.5.0
- reference / References and related reading

S

S3 outage
- reference / Why write a postmortem?
Salt / How should we change our capacity?
scientific method steps
- observe / What do you test?
- question / What do you test?
- hypothesis / What do you test?
- test / What do you test?
- reject or approve / What do you test?
security
- about / Security
- authentication / Security, Authentication
- authorization / Security, Authorization
- risk profile / Security, Risk profile
- phishing / Phishing
Sensu / Sensu
service-oriented architecture (SOA) / SRE as a framework for new projects
Service Level Agreement (SLA) / Service levels
Service Level Indicator (SLI) / Service levels
Service Level Objective (SLO) / Service levels
Service Level Objectives (SLOs) / Managing risk and managing expectations
shadow testing
- tools / Testing infrastructure
Simple Notification Service (SNS) / Alerting services, Queues and Pub/Sub
Simple Queue Service (SQS) / Queues and Pub/Sub
Sinatra
- reference / References and related reading
Site / What is SRE?
Site Reliability Engineering (SRE)
- history / A brief history, What is SRE?
- using, as framework for new projects / SRE as a framework for new projects
SLIs / A short introduction to SLIs, SLOs, and error budgets
SLOs / A short introduction to SLIs, SLOs, and error budgets
sockets / Sockets
Software as a Service (SaaS) / Service levels, Cloud fundamentals
SSL (Secure Sockets Layer) / Finding projects
StatsD
- about / StatsD
- reference / References and related reading
StatsD Ruby library
- reference / References and related reading
sticky bit / Permissions
storage
- types / Storage
syscalls
- about / syscalls
- tracing, with strace tool / How to trace
- processes, watching / Watching processes
- averages, loading / Load averages
system
- recovering / Recovering the system

T

tcpdump / tcpdump
tcpreplay
- reference / Testing infrastructure
tech
- as profit center / Tech as a profit center and procurement
- as procurement / Tech as a profit center and procurement
Telegraf / Telegraf
test-driven development (TDD) / Unit tests
testing
- about / Testing
- need for / What do you test?
testing code
- about / Testing code
- review / Code reviews
tools
- improving / Experience of tools
- performance budgets / Performance budgets
tools, for network watching
- about / Tools for watching the network
- netstat / netstat
- nc / nc
- tcpdump / tcpdump
Transmission Control Protocol (TCP) / TCP
Transportation Security Administration (TSA) / Performance budgets
Transport Layer Security (TLS) / HTTP
Twelve Factor App / State and concurrency
two device identifiers / Files, directories, and inodes

U

units of scale / Units of scale
USDS (United States Digital Service) / Design documents
user-generated content (UGC) / Unpredictable growth–user-generated content
User Datagram Protocol (UDP) / UDP
user interface (UI) / An introduction to design and UX
user testing
- about / User testing
- experience, picking / Picking an experience
- test, designing / Designing the test
- people, finding to test / Finding people to test
UTF-8 / HTTP
UX
- about / An introduction to design and UX

V

vegeta
- reference / Testing infrastructure
Vertica / What is our current capacity?
virtual machines (VMs) / Cloud fundamentals, VMs

W

wget / curl and wget
Wheel of Misfortune (WoM) / Testing processes

Real-World SRE

By : Pavlos Ratis, Nat Welch

Real-World SRE

By: Pavlos Ratis, Nat Welch

Overview of this book

Related Content you might be interested in

Current Title:

Real-World SRE

Becoming a Rockstar SRE

Index

A

B

C

D

E

F

G

H

I

J

L

M

N

O

P

Q

R

S

T

U

V

W