Book Image

Becoming a Rockstar SRE

By : Jeremy Proffitt, Rod Anami
Book Image

Becoming a Rockstar SRE

By: Jeremy Proffitt, Rod Anami

Overview of this book

Site reliability engineering is all about continuous improvement, finding the balance between business and product demands while working within technological limitations to drive higher revenue. But quantifying and understanding reliability, handling resources, and meeting developer requirements can sometimes be overwhelming. With a focus on reliability from an infrastructure and coding perspective, Becoming a Rockstar SRE brings forth the site reliability engineer (SRE) persona using real-world examples. This book will acquaint you the role of an SRE, followed by the why and how of site reliability engineering. It walks you through the jobs of an SRE, from the automation of CI/CD pipelines and reducing toil to reliability best practices. You’ll learn what creates bad code and how to circumvent it with reliable design and patterns. The book also guides you through interacting and negotiating with businesses and vendors on various technical matters and exploring observability, outages, and why and how to craft an excellent runbook. Finally, you’ll learn how to elevate your site reliability engineering career, including certifications and interview tips and questions. By the end of this book, you’ll be able to identify and measure reliability, reduce downtime, troubleshoot outages, and enhance productivity to become a true rockstar SRE!
Table of Contents (27 chapters)
1
Part 1 - Understanding the Basics of Who, What, and Why
5
Part 2 - Implementing Observability for Site Reliability Engineering
10
Part 3 - Applying Architecture for Reliability
16
Part 4 - Mastering the Outage Moments
20
Part 5 - Looking into Future Trends and Preparing for SRE Interviews

Making this journey personal

Unfortunately, often when an enterprise starts to adopt SRE into their IT governance processes, they don’t use a people-processes-tools (PPT) model to transform their operations and software development areas, having a clear vision of these pillars. Even more often, they don’t emphasize or focus on the people element of PPT in such transformations. We want to change that by making this learning journey personal and centered on the individuals rather than the involved processes or technologies.

It’s critical to understand (and learn) what drives typical SREs forward, which fundamental skills they have developed, and how they hone their skills over time to go above and beyond at work. For that purpose, we will divide this subject into three sections:

  • SRE driving forces
  • SRE skills
  • SRE traits

Let’s start this personal journey by understanding why you should become an SRE.

SRE driving forces

We want to explore what motivates or incentivizes site reliability engineers. There’s no journey of any nature if there is no driving force pushing you through. As a word of advice, we should warn you that learning about site reliability engineering is more of an expedition than a tourism trip. In other words, it’s more a marathon than a sprint. Having clarified that, we’ll begin by putting the possible rewards of this journey on the table. Let’s depict each driving force as a mockup code snippet (JavaScript) to make it fun.

Money

If we could represent in the form of an algorithm how money drives people when they don’t earn enough, it would look like the following:

// money
if (money < MyMinimumSalary) {
motivated = false;
excitement--;
}
doMyWork();
if (motivated && jobSatisfaction) {
    honeSRESkills();
    doExtraWork();
} else lookForAnotherJob();

Site reliability engineers make more money than most other technical professionals. According to a Glassdoor (2022) report, they can earn more than USD 118K per year on average. In similar reports, SREs are even noted to have surpassed DevOps engineers in a salary comparison. Nevertheless, not making enough money can be a key demotivating factor. It is hard for anyone to move forward with their career if they are preoccupied with expenses.

Although SREs have a notorious income on average, their salaries will vary per country, years of experience, and employer. Companies justify SRE salary levels based on the reliability value they bring to the table. Rest assured, the site reliability engineering career is well paved in the compensation field.

Job satisfaction

What affects our job satisfaction can be depicted as code logic as follows:

// jobSatisfaction
if (interestingJob || purposefulWorkActivities || challengingSkillDevelopment || technicalAppreciation) {
    jobSatisfaction = true;
    excitement++;
}

Job satisfaction is another driving force of site reliability engineers, and it has many factors. We usually translate job satisfaction to employee happiness at work. Site reliability engineering leads to job satisfaction when we look at the following profession characteristics: exciting job content, purposeful work activities, challenging skill development, and technical appreciation.

The job content of site reliability engineering spans multiple domains. You can work with developers one day and help systems administrators the next. You may need to assist in redesigning an app to increase its service reliability. As with any generalist model job with technical depth in many subject areas, you will never get bored for sure.

As we will see later in this chapter, SRE work activities have clear business value. They improve not just the service quality, availability, and resiliency, but also the system’s reliability. Reliable services might help with customer loyalty, bringing additional revenue to the service provider. There is a direct relationship between SRE work and business metrics improvement, making their efforts purposeful.

Since site reliability engineering is a cross-technology domain engineering discipline, any skills acquisition is challenging. SREs have knowledge and skills that a systems administrator or software developer doesn’t have. They are required to keep those skills updated and hone them over time. This necessity to keep learning brings the always-moving-forward feeling that may not happen if you only need to master a single product or technology.

The last factor on our list is technical appreciation. According to Boston Consulting Group (BCG) research, appreciation is the number one job happiness factor. Being an SRE, you will aid customers, users, and other technical professionals because of your keen holistic view of the systems. Consequently, technical appreciation for the job you do is common, and who doesn’t like that?

Innovative solutions

The following code gives you an idea of how exciting exploring uncharted terrains is:

If (!solutionExists) {
    deviseNewSolution();
    excitement++;
}

Site reliability engineers are natural trailblazers as they explore new technologies and processes to obtain better reliability and eliminate toil (manual and repetitive tasks that are devoid of value). They face many scenarios and situations that are a first of their kind. Moreover, they are responsible for paving the path for others by documenting procedures in runbooks when none exist. There’s nothing more exciting than devising new solutions or improving existing ones. Imagine how you would feel if they named a technical operating procedure after you.

Nevertheless, SREs want to minimize complexity and reduce technical debt. They don’t create a solution just for the sake of doing it unless it adds value and resolves or prevents events that impact customers.

Good relationships

The following code snippet is a representation of how good relationships are a result of an exciting working environment:

If (excitement > HIGH) {
    motivateOthers();
    relationships.healthy = true;
}

Also, good work environment relationships are one of the top 10 factors contributing to employee happiness. SREs have good relationships in their work environment. The reason is straightforward; they act as integration hubs among different tribes and have the mission to break company siloes. SREs need cooperation from both development and operations teams. They are technical diplomats and have strong communication skills. Since they are usually excited about their work, they tend to socialize more with colleagues and leaders, potentially helping to improve the social environment around them. That doesn’t mean they need to be extroverts with progressive public-speaking skills, but certainly, SREs are good teachers because they are excited and compelled to talk about what they do.

SRE skills

Now that you know what’s in it for you, it’s time to check which skillsets SREs must develop throughout their careers. Site reliability engineers have a good mix of knowledge, skills, and experience that are shared with other roles and those that are unique to them. SREs have technical skills that span the entire solution life cycle, from the design to the manage step.

Figure 1.1 – SRE skills

Figure 1.1 – SRE skills

The preceding Venn diagram shows how SREs acquire skills common to other professions and how SRE skills connect the various steps of the solution life cycle. In essence, site reliability engineers are senior technical resources that follow a generalist proficiency model with good depth at certain areas of expertise.

There’s no consensus in the market about the canonical set of skills for SREs. It would not make sense for this to be the case because as soon as any technology-based skill becomes obsolete, we would need to remodel the whole profession. Instead, SRE core skills should be as technology-agnostic as possible.

We recommend a blend of distinct expertise from the IT architect, software developer, data scientist, DevOps engineer, and systems administrator roles. The proportion of each skill level varies per a multitude of factors. You will need to determine which skills are more in demand than the others, but an organization should have all of them in its toolkit.

Systems thinking

Site reliability engineers have a holistic view of the system’s reliability by understanding the availability, resiliency, and performance of each solution component at both the application and infrastructure levels.

Software engineering

SREs develop code and software. They know how to utilize algorithms and software development techniques such as agile frameworks. SREs need to be proficient enough in instrumenting the app code to increase its manageability and observability. SREs know how to use software development life cycle tools and technologies, including DevOps (continuous integration/continuous deliveryCI/CD) pipelines. They can provide testers with better test cases that consider service reliability targets.

Systems management

SREs know how to manage, administer, and operate systems. They share most of the skills from the systems administrator role on multiple technologies. Their technology knowledge spans the cloud, containers, storage, networking, operating systems, middleware, and databases. They have the skills to implement monitoring, event management, logging, tracing, service levels, observability, DevOps toolchains, and automation of toil.

Data science

SREs work with huge amounts of structured data. They must acquire the knowledge and skills to make sense of such datasets by using mathematical models. SREs know how to analyze data to uncover trends, anomalies, and insights – always from the user’s perspective.

We recommend that every SRE has the following selection of core skills:

  • Systems thinking for focusing on the reliability of the system
  • The ability to develop and test software
  • The ability to deploy and release apps
  • IT service management
  • Systems monitoring and observability
  • Working with DevOps tools and automation
  • The application of data science for reliability of systems

Although we didn’t explicitly mention security in any of the knowledge domains, enforcing security across multiple layers is present in all of them.

Important note

All fundamental SRE skills are covered in this book’s chapters. The chapters have been organized to optimize your learning journey, so they don’t follow the preceding order of skills. We structured this book based on our own experience acquired from a multitude of site reliability engineering coaching and mentoring sessions.

We provide a manifesto model in the Appendix A, The Site Reliability Engineer Manifesto, that acts as a more structured guide for site reliability engineering adoption, including the fundamental skills. We hope that helps your company in joining the site reliability engineering movement.

SRE traits

Besides what the SREs know and which skills they must develop, it’s relevant to know their other good traits.

Software is everywhere

Site reliability engineers have a software engineering mindset. The idea of approaching any issue as a software problem may be disruptive at first; however, there is a good reason for it. Imagine that you need to restart and verify a system by manually issuing a specific set of commands and parameters many times per week. If you handle it as a software development problem, the solution will be developing and scheduling a simple program or automation to execute this task instead. SREs embrace automation over toil as one of their best tools.

Comfortable to code

SREs are not just able to develop code; they really like doing it. As we will see later in this chapter, they develop code as a frequent activity and main responsibility. It’s not just a question of learning how to code or program when someone asks; SREs always feel confident in constructing good pieces of software.

Change as a constant

Frequent releases of new features, code enhancements, and reliability improvements are vital for any business. SREs are the first to accept calculated risks to provide more value to the system users. They are not risk averse but bring visibility to inherent risks so they can throttle the speed of change. They are always prepared to make progress and go above and beyond for service reliability.

Handle complexity and scale

They are not afraid of complexity or scale. They know that modern workloads are intrinsically complex and must be scalable horizontally and vertically. SREs work with large systems with multiple components running in hybrid multi-cloud environments. They understand the application’s full-stack design, its moving parts, and how they connect to each other.

Problems as opportunities

SREs participate in on-call rotations and schedules to respond to service disruptions. They see incidents and problems as opportunities to learn and advance the reliability of the system. Not just that, they also have the competence to translate technology into business language to measure the impact on users and customers. They advocate for a blameless culture by prioritizing answers to questions, such as how to detect and repair incidents faster next time. They also consider how technical challenges may affect business results.

We have just gone over what makes the SRE persona: their motivations, skills, and traits. Now we are going to understand how site reliability engineers think.