Home Cloud & Networking DevOps for Databases

DevOps for Databases

By David Jambor
books-svg-icon Book
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Data at Scale with DevOps
About this book
In today's rapidly evolving world of DevOps, traditional silos are a thing of the past. Database administrators are no longer the only experts; site reliability engineers (SREs) and DevOps engineers are database experts as well. This blurring of the lines has led to increased responsibilities, making members of high-performing DevOps teams responsible for end-to-end ownership. This book helps you master DevOps for databases, making it a must-have resource for achieving success in the ever-changing world of DevOps. You’ll begin by exploring real-world examples of DevOps implementation and its significance in modern data-persistent technologies, before progressing into the various types of database technologies and recognizing their strengths, weaknesses, and commonalities. As you advance, the chapters will teach you about design, implementation, testing, and operations using practical examples, as well as common design patterns, combining them with tooling, technology, and strategies for different types of data-persistent technologies. You’ll also learn how to create complex end-to-end implementation, deployment, and cloud infrastructure strategies defined as code. By the end of this book, you’ll be equipped with the knowledge and tools to design, build, and operate complex systems efficiently.
Publication date:
December 2023
Publisher
Packt
Pages
446
ISBN
9781837637300

 

Data at Scale with DevOps

Welcome to the first chapter! In this book, you will learn the fundamentals of DevOps, its impact on the industry, and how to apply it to modern data persistence technologies.

When I first encountered the term DevOps years ago, I initially saw it as a way to grant development teams unrestricted access to production environments. This made me nervous, especially because there seemed to be a lack of clear accountability at that time, making the move toward DevOps appear risky.

At the time (around 2010), the roles of developers and operations were divided by a very strict line. Developers could gain read-only privileges, but that’s about it. What I did not see back then was that this was the first step in blurring the lines between development and operation teams. We already had many siloed teams pointing fingers at one another. This made the work slow, segmented, and frustrating. I was worried this would just increase complexity and cause an even greater challenge. Luckily, today’s world of DevOps is very different, and we can all improve it together even further!

There are no more dividing lines between the development and operations teams – they are one team with a common objective. This improves quality, speed, and agility! This also means that traditional roles such as database admin are changing as well. We now have site reliability engineers (SREs) or DevOps engineers who are experts at using databases and able to perform operational and development tasks alike. Blurring the line means you increase the responsibilities, and in a high-performing DevOps team, this means you are responsible for everything from end to end. Modern tooling and orchestration frameworks can help you do way more than ever before, but it’s a very different landscape than it was many years ago.

This book will introduce you to this amazing new world, walk you through the journey that leads us to this ever-changing world of DevOps today, and give some indications as to where we might go next.

By the end of this book, you will be able to not only demonstrate your theoretical knowledge but also design, build, and operate complex systems with a heavy focus on data persistence technologies.

DevOps and data persistence technologies have a love-hate relationship, which makes this topic even more interesting.

In this chapter, we will take a deep dive into the following topics:

  • The modern data landscape
  • Why speed matters
  • Data management strategies
  • The early days of DevOps
  • SRE versus DevOps
  • Engineering principles
  • Objectives – SLOs/SLIs
 

The modern data landscape

Have you ever wondered how much data we generate every single day? Or the effort required to store and access your data on demand? What about the infrastructure or the services required to make all of this happen? Not to mention the engineering effort put in to make all of this happen. If you have, you are in the right place. These questions inspired me to dive deep into the realms of DevOps and SRE and inspired the creation of this book.

Technology impacts almost every aspect of our lives. We are more connected than ever, with access to more information and services than we even realize. It’s not just our computers, phones, or tablets that are connected to the internet, but our cars, cameras, watches, televisions, speakers, and more. The more digital native we become, the bigger our digital footprint grows.

A digital footprint, also known as a digital shadow, is a collection of data that represents an individual’s interactions and activities across digital platforms and the internet. This data can be categorized as either passive, where it’s generated without direct interaction – such as browsing history – or active, resulting from deliberate online actions such as social media posts or emails. Your digital footprint serves as an online record of your digital presence, and it can have lasting implications for your privacy and reputation.

As of 2022, researchers estimate that out of 8 billion people (the world’s population as of 2022), approximately 5 billion utilize the internet daily. Compared to the 2 billion that was measured in 2012, this is a 250% increase over 10 years. This is an incredible increase. See the following figure for reference:

Figure 1.1 – Daily internet users (in billions)

Figure 1.1 – Daily internet users (in billions)

Each person who has a digital presence generates digital footprints in two ways.

The first is actively. When you browse a website, upload a picture, send an email, or make a video call, you generate data that will be utilized and stored for some time. The other, less obvious way is passive data generation. If you, like me, utilize digital services with push notifications on or have GPS enabled on your phone with a timeline, for example, you are generating data every minute of the day – even if you do not use these services actively. Prime examples can be any Internet of Things (IoT) devices, something such as an internet-enabled security camera – even if you are not actively using it, it’s still generating data and constantly uploading it to your service provider for safekeeping. IoT devices are the secondary source of data generators right after us active internet surfers. Researchers estimate that approximately 13 billion IoT devices are being connected and in daily use as of 2022, with the expectation that this figure will become close to 30 billion by the end of 2030. See the following figure for reference:

Figure 1.2 – Connected IoT devices (in the billions)

Figure 1.2 – Connected IoT devices (in the billions)

Combining the 5 billion active internet users with the 13 billion connected IoT devices, it is easy to guess that our combined digital footprint must be ginormous. Yet trying to guess the exact number is much harder than you might think. Give it a try.

As of 2023, it is estimated that we generate approximately 3.5 exabytes of data every single day. This is about 1 exabyte more than what was estimated in 2021. To help visualize how much data we are talking about, let me try to put this into perspective. Let’s say you have a notebook (or one of the latest phones) with 1 TB of storage capacity. If you were to use this 1 TB storage to store all this information, it would be full in less than 0.025 seconds. An alternative way to think about it is that we can fill 3,670,016 devices with 1 TB storage within 24 hours.

How do we generate data today?

Well, for starters, we collectively send approximately 333.2 billion emails per day. This means that more than 3.5 million emails are sent per second. We also make over 0.5 billion hours of video calls, stream more than 200 million hours of media content, and share more than 5 billion videos and photos every single day.

So, yes, that’s a lot of us armed with many devices (on average, one active internet user had about 2.6 IoT devices in 2022) generating an unbelievable amount of data every single day. But the challenge does not stop at the amount of data alone. The speed and reliability of interacting with it are just as important as, if not more important than, the storage itself. Have you ever searched for one of your photos to show someone, but it was slow and took forever to find, so you gave up? We have all been there, but can you remember just how much time after doing this that you decided to abandon your search?

As technology advances, we gain quicker access to information and multitask more efficiently, which may be contributing to a gradual decline in our attention spans. Research shows that in 2000, the average attention span was 12 seconds. Since then, significant technological milestones have occurred: the advent of the iPhone, YouTube, various generations of mobile networks, Wikipedia, and Spotify, to name a few. Internet speed has also soared, moving from an average of 127 kilobits per second in 2000 to 4.4 Mbps by 2010, and hitting an average of 50.8 Mbps by 2020 – with some areas experiencing speeds well over 200 Mbps today.

As the digital landscape accelerates, so do our expectations, resulting in further erosion of our attention spans. By 2015, that 12-second average had fallen to just 8.25 seconds and dropped slightly below 8 seconds by 2022.

 

Why speed matters

If you consider your attention span the full amount of time you would consider spending to complete a simple task, such as showing photos or videos to a friend, this means searching for it is just a small percentage of your total time. Let’s say you are using a type of cloud service to search for your photo or video. What would you consider to be an acceptable amount of time between you hitting search and receiving your content?

I still remember the time when “buffering” was a given thing, but if you see something similar today, you would find it unacceptable. According to multiple studies, the ideal load time for “average content,” such as photos or videos, is somewhere between 1 and 2 seconds. 53% of mobile site visits are abandoned if pages take longer than three seconds to load. A further two-second delay in load time results in abandonment rates of up to 87%.

This shows us that storing our data is not enough – making it accessible reliably and with blazing speed is not only nice to have but an absolute necessity in today’s world.

 

Data management strategies

There are many strategies out there, and we will need to use most of them to meet and hopefully exceed our customers’ expectations. Reading this book, you will learn about some of the key data management strategies at length. For now, however, I would like to bring six of these techniques to your attention. We will take a much closer look at each of these in the upcoming chapters:

  • Bring your data closer: The closer the data is to users, the faster they can access it. Yes, it may sound obvious, but users can be anywhere in the world, and they might even be traveling while trying to access their data. For them, these details do not matter, but the expectation will remain the same.

    There are many different ways to keep data physically close. One of the most successful strategies is called edge computing, which is a distributed computing paradigm that brings computation and data storage closer to the sources of data. This is expected to improve response times and save bandwidth. Edge computing is an architecture rather than a specific technology (and a topology), and is a location-sensitive form of distributed computing.

The other very obvious strategy is to utilize the closest data center possible when utilizing a cloud provider. AWS, for example, spans 96 Availability Zones within 30 geographic Regions around the world as of 2022. Google Cloud offers a very similar 106 zones and 35 regions as of 2023.

Leveraging the nearest physical location can greatly decrease your latency and therefore your customer experience.

  • Reduce the length of your data journey: Again, this is a very obvious one. Try to avoid any unnecessary steps to create the shortest journey between the end user and their data. Usually, the shortest will be the fastest (obviously it’s not that simple, but as a best practice, it can be applied). The greater the number of actions you do to retrieve the required information, the greater computational power you utilize, which directly increases the cost associated with the operation. It also linearly increases the complexity and most of the time increases latency and cost as well.
  • Choose the right database solutions: There are many database solutions out there that you can categorize based on type, such as relational to non-relational (or NoSQL), the distribution being centralized or distributed, and so on. Each category has a high number of sub-categories and each can offer a unique set of solutions to your particular use case. It’s really hard to find the right tool for the job, considering that requirements are always changing. We will dive deeper into each type of system and their pros and cons a bit later in this book.
  • Apply clever analytics: Analytical systems, if applied correctly, can be a real game changer in terms of optimization, speed, and security. Analytics tools are there to help develop insights and understand trends and can be the basis of many business and operational decisions. Analytical services are well placed to provide the best performance and cost for each analytics job. They also automate many of the manual and time-consuming tasks involved in running analytics, all with high performance, so that customers can quickly gain insights.
  • Leverage machine learning (ML) and artificial intelligence (AI) to try to predict the future: ML and AI are critical for a modern data strategy to help businesses and customers predict what will happen in the future and build intelligence into their systems and applications. With the right security and governance control combined with AI and ML capabilities, you can make automated actions regarding where data is physically located, who has access to it, and what can be done with it at every step of the data journey. This will enable you to stick with the highest standards and greatest performance when it comes to data management.
  • Scale on demand: The aforementioned strategies are underpinned by the method you choose to operate your systems. This is where DevOps (and SRE) plays a crucial part and can be the deciding factor between success and failure. All major cloud providers provide you with literally hundreds of platform choices for virtually every workload (AWS offered 475 instance types at the end of 2022). Most major businesses have a very “curvy” utilization trend, which is why they find the on-demand offering of the cloud very attractive from a financial point of view.

You should only pay for resources when you need them and pay nothing when you don’t. This is one of the big benefits of using cloud services. However, this model only works in practice if the correct design and operational practices and the right automation and compatible tooling are utilized.

A real-life example

A leading telecommunications company was set to unveil their most anticipated device of the year at precisely 2 P.M., a detail well publicized to all customers. As noon approached, their online store saw typical levels of traffic. By 1 P.M., it was slightly above average. However, a surge of customers flooded the site just 10 minutes before the launch, aiming to be among the first to secure the new phone. By the time the clock struck 2 P.M., the website had shattered previous records for unique visitors. In the 20 minutes from 1:50 P.M. to 2:10 P.M., the visitor count skyrocketed, increasing twelvefold.

This influx triggered an automated scaling event that expanded the company’s infrastructure from its baseline (designated as 1x) to an unprecedented 32x. Remarkably, this massive scaling was needed only for the initial half-hour. After that, it scaled down to 12x by 2:30 P.M., further reduced to 4x by 3 P.M., and returned to its baseline of 1x by 10 P.M.

This seamless adaptability was made possible through a strategic blend of declarative orchestration frameworks, infrastructure as code (IaC) methodologies, and fully automated CI/CD pipelines. To summarize, the challenge is big. To be able to operate reliably yet cost-effectively, with consistent speed and security, all the while automatically scaling these services up and down on demand without human interaction in a matter of minutes, you need a set of best practices on how to design, build, test, and operate these systems. This sounds like DevOps.

 

The early days of DevOps

I first came across DevOps around 2014 or so, just after the first annual State of DevOps report was published. At the time, the idea sounded great, but I had no idea how it worked. It felt like – at least to me – it was still in its infancy or I was not knowledgeable and experienced enough to see the big picture just yet. Probably the latter. Anyway, a lot has happened since then, and the industry picked up the pace. Agile, CI/CD, DevSecOps, GitOps, and other approaches emerged on the back of the original idea, which was to bring software developers and operations together.

DevOps emerged as a response to longstanding frictions between developers (Devs) and operations (Ops) within the IT industry. The term obvious seems apt here because, for anyone involved in IT during that period, the tension was palpable and constant. Devs traditionally focused solely on creating or fixing features, handing them off to Ops for deployment and ongoing management. Conversely, Ops prioritized maintaining a stable production environment, often without the expertise to fully comprehend the code they were implementing.

This set up an inherent conflict: introducing new elements into a production environment is risky, so operational stability usually involves minimizing changes. This gave rise to a “Devs versus Ops” culture, a divide that DevOps sought to bridge. However, achieving this required both sides to evolve and adapt.

In the past, traditional operational roles such as system administrators, network engineers, and monitoring teams largely relied on manual processes. I can recall my initial stint at IBM, where the pinnacle of automation was a Bash script. Much of the work in those days – such as setting up physical infrastructure, configuring routing and firewalls, or manually handling failovers – was done by hand.

While SysAdmin and networking roles remain essential, even in the cloud era, the trend is clearly toward automation. This shift enhances system reliability as automated configurations are both traceable and reproducible. If systems fail, they can be swiftly and accurately rebuilt.

Though foundational knowledge of network and systems engineering is irreplaceable, the push toward automation necessitates software skills – a proficiency often lacking among traditional operational engineers. What began with simple Bash scripts has evolved to include more complex programming languages such as Perl and Python, and specialized automation languages such as Puppet, Ansible, and Terraform.

In terms of the development side, the development team worked with very long development life cycles. They performed risky and infrequent “big-bang” releases that almost every time caused massive headaches for the Ops teams and posed a reliability/stability risk to the business. Slowly but steadily, Dev teams moved to a more frequent, gradual approach that tolerated failures better. Today, we call this Agile development.

If you look at it from this point of view, you can say that a set of common practices designed to reduce friction between Dev and Ops teams is the basis of DevOps. However, simple common practices could not solve the Dev versus Ops mentality that the industry possessed at the time. Shared responsibility between Devs and Ops was necessary to drive this movement to success. Automation that enables the promotion of new features into production rapidly and safely in a repeatable manner could only be achieved if the two teams worked together, shared a common objective, and were accountable (and responsible) for the outcome together. This is where SRE came into the picture.

 

SRE versus DevOps

SRE originated at Google. In the words of Ben Treynor (VP of engineering at Google), “SRE is what happens when you ask a software engineer to design an operations function.”

If you want to put it simply (again, I am quoting Google here), “Class SRE implements DevOps.”

SRE is the (software) engineering discipline that aims to bridge the gap between Devs and Ops by treating all aspects of operations (infrastructure, monitoring, testing, and so on) as software, therefore implementing DevOps in its ultimate form. This is fully automated, with zero manual interaction, treating every single change to any of its components (again referring to any changes to infrastructure, monitoring, testing, and so on) as a release. Every change is done via a pipeline, in a version-controlled and tested manner. If a release fails, or a production issue is observed and traced back to a change, you can simply roll back your changes to the previously known, healthy state.

The fact that it is treated as any other software release allows the Dev teams to take on more responsibility and take part in Ops, almost fully blurring the line between the Dev and Ops functions. Ultimately, this creates a You build it, you run it culture – which makes “end-to-end” ownership possible.

So, are SRE and DevOps the same thing? No, they are not. SRE is an engineering function that can also be described as a specific implementation of DevOps that focuses specifically on building and running reliable systems, whereas DevOps is a set of practices that is more broadly focused on bringing the traditional Dev and Ops functions closer together.

Regardless of which way you go, you want to ensure that you set an objective, engineering principles, and a tooling strategy that can help you make consistent decisions as you embark on your journey as a DevOps/SRE professional.

 

Engineering principles

I offer the following engineering principles to start with:

  • Zero-touch automation for everything (if it’s manual – and you have to do it multiple times a month – it should be automated)
  • Project-agnostic solutions (defined in the configuration to avoid re-development for new projects, any tool/module should be reusable)
  • IaC (infrastructure should be immutable where possible and defined as code; provisioning tools should be reusable)
  • Continuous delivery (CD) with continuous integration (CI) (common approaches and environments across your delivery cycle; any service should be deployable immediately)
  • Reliability and security validated at every release (penetration testing, chaos testing, and more should be added to the CI/CD pipeline; always identify the point of flavors at your earliest)
  • Be data-driven (real-time data should be utilized to make decisions)

To fully realize your engineering goals and adhere to your principles without compromise, you should make “immutable IaC” a priority objective.

To enable this, I would recommend the following IaC principles:

  • Systems can be easily reproduced
  • Systems are immutable
  • Systems are disposable
  • Systems are consistent
  • Processes are repeatable
  • Code/config are version-controlled

Once you have defined your goals, it’s time for you to choose the right tools for the job. To do that, you must ensure these tools are allowed to utilize the following:

  • Declarative orchestration framework(s):
    • The declarative orchestration approach uses structural models that describe the desired application structure and state. These are interpreted by a deployment engine to enforce this state.
    • It enables us to define the end state and interact in a declarative manner, thus making managing the application less resource-intensive (faster speed to market and cheaper costs).

    The following is an example Terraform file (main.tf):

    provider "aws" {
      region = "us-west-2"
    }
    # Create an S3 bucket
    resource "aws_s3_bucket" "my_bucket" {
      bucket = "my-unique-bucket-name"
      acl    = "private"
    }
    # Create an EC2 instance
    resource "aws_instance" "my_instance" {
      ami           = "ami-0c55b159cbfafe1f0" # This is an example Amazon Linux 2 AMI ID; use the appropriate AMI ID for your needs
      instance_type = "t2.micro"
      tags = {
        Name = "MyInstance"
      }
    }
  • Declarative resource definition:
    • In a declarative style, you simply say what resources you would like, and any properties they should have so that you can create and deploy an entire infrastructure declaratively. For example, you can deploy not only agents (or sidecars) but also the network infrastructure, storage systems, and any other resources you may need.
    • This enables us to define what our infrastructure resources should look like and force the orchestrator to create it (focus on the how while leveraging declarative orchestration).

    The following is an example that uses Kubernetes, which is a popular container orchestration platform that exemplifies the concept of declarative resource definition. In this example, we’ll define a Deployment for a simple web server and a Service to expose it.

    Here’s a YAML file (deployment-and-service.yaml) for Kubernetes:

    # Deployment definition to create a web server pod
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: my-web-server
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: web-server
      template:
        metadata:
          labels:
            app: web-server
        spec:
          containers:
          - name: nginx
            image: nginx:1.17
            ports:
            - containerPort: 80
    ---
    # Service definition to expose the web server
    apiVersion: v1
    kind: Service
    metadata:
      name: my-web-service
    spec:
      selector:
        app: web-server
      ports:
        - protocol: TCP
          port: 80
  • Idempotency:
    • This allows you to create and deploy an entire infrastructure declaratively. For example, you can deploy not only agents (or sidecars) but also the network infrastructure, storage systems, and any other resources you may need. Idempotency is the property that an operation may be applied multiple times with the result not differing from the first application. Restated, this means multiple identical requests should have the same effect as a single request.
    • Idempotency enables the same request to be sent multiple times but the result given is always the same (same as declared, never different).
  • No secrets and environment config in code:
    • The main cloud providers all have a secure way to manage secrets. These solutions provide a good way to store secrets or environment config values for the application you host on their services.
    • Everything should be self-served and manageable in a standardized manner and therefore secrets and configs must be declarative and well defined to work with the aforementioned requirements.
  • Convention over configuration:
    • Also known as environment tag-based convention over configuration, convention over configuration is a simple concept that is primarily used in programming. It means that the environment in which you work (systems, libraries, languages, and so on) assumes many logical situations by default, so if you adapt to them rather than creating your own rules each time, programming becomes an easier and more productive task.
    • This means that developers have to make fewer decisions when they’re developing and there are always logical default options. These logical default options have been created out of convention, not configuration.
  • Automation scripts packaged into an image:
    • This enables immutability and encourages sharing. No longer is a script located on a server and then has to be copied to others – instead, it can be shipped just like the rest of our code, enabling scripts to be available in a registry rather than dependent on others.

Thanks to the amazing progress in this field in the past 10+ years, customer expectations are sky-high when it comes to modern solutions. As we established earlier, if content does not load in under two seconds, it is considered to be slow. If you have to wait longer than 3 to 5 seconds, you are likely to abandon it. This is very similar to availability and customer happiness. When we talk about customer happiness (which evolved from customer experience), a concept you cannot measure and therefore cannot be data-driven, setting the right goals/objectives can be crucial to how you design your solutions.

 

Objectives – SLOs/SLIs

Service-level objectives (SLOs), which is a concept that’s referenced many times in Google’s SRE handbook, can be a great help to set your direction from the start. Choosing the right objective, however, can be trickier than you might think.

My personal experience aligns with Google’s recommendation, which suggests that an SLO – which sets the target for the reliability of a service’s customers – should be under 100%.

This is due to multiple reasons. Achieving 100% is not just very hard and extremely expensive, but almost impossible given that almost all services have soft/hard dependencies on other services. If just one of your dependencies offers less than 100% availability, your SLO cannot be met. Also, even with every precaution you can make, and every redundancy in place, there is a non-zero probability that something (or many things) will fail, resulting in less than 100% availability. More importantly, even if you could achieve 100% reliability of your services, the customers would very likely not experience that. The path your customers must take (the systems they have to use) to access your services is likely to have less than 100% SLO.

Most commercial internet providers, for example, offer 99% availability. This also means that as you go higher and higher, let’s say from 99% to 99.9% or IBM’s extreme five nines (99.999%), the cost of achieving and maintaining this availability will be significantly more expensive the more “nines” you add, but your customers will experience less and less of your efforts, which makes the objective questionable.

Above the selected SLO threshold, almost all users should be “happy,” and below this threshold, users are likely to be unhappy, raise concerns, or just stop using the service.

Once you’ve agreed that you should look for an SLO less than 100%, but likely somewhere above or around 99%, how do you define the right baseline?

This is where service-level indicators (SLIs), service-level agreements (SLAs), and error budgets come into play. I will not detail all of these here, but if you are interested, please refer to Google’s SRE book (https://sre.google/books/) for more details on the subject.

Let’s say you picked an SLO of 99.9% – which is, based on my personal experience, the most common go-to for businesses these days. You now have to consider your core operational metrics. DevOps Research and Assessment (DORA) suggests four key metrics that indicate the performance of a DevOps team, ranking them from “low” to “elite,” where “elite” teams are more likely to meet or even exceed their goals and delight their customers compared to “low” ranking teams.

These four metrics are as follows:

  • Lead time for change, a metric that quantifies the duration from code commit to production deployment, is in my view one of the most crucial indicators. It serves as a measure of your team’s agility and responsiveness. How swiftly can you resolve a bug? Think about it this way:
    • Low-performing: 1 month to 6 months of lead time
    • Medium-performing: 1 week to 1 month of lead time
    • High-performing: 1 day to 1 week of lead time
    • Elite-performing: Less than 1 day of lead time
  • Deployment frequency, which measures the successful release count to production. The key word here is successful, as a Dev team that constantly pushes broken code through the pipeline is not great:
    • Low-performing: 1 month to 6 months between deployments
    • Medium-performing: 1 week to 1 month between deployments
    • High-performing: 1 day to 1 week between deployments
    • Elite-performing: Multiple deployments per day/less than 1 day between deployments
  • Change failure rate, which measures the percentage of deployments that result in a failure in production that requires a bug fix or rollback. The goal is to release as frequently as possible, but what is the point if your team is constantly rolling back those changes, or causing an incident by releasing a bad update? By tracking it, you can see how often your team is fixing something that could have been avoided:
    • Low-performing: 45% to 60% CFR
    • Medium-performing: 15% to 45% CFR
    • High-performing: 0% to 15% CFR
    • Elite-performing: 0% to 15% CFR
  • Mean time to restore (MTTR) measures how long it takes an organization to recover from a failure. This is measured from the initial moment of an outage until the incident team has recovered all services and operations. Another key and related metric is mean time to acknowledge (MTTA), which measures the time it takes to be aware of and confirm an issue in production:
    • Low-performing: 1 week to 1 month of downtime
    • Medium- and high-performing: Less than 24 hours of downtime
    • Elite-performing: Less than 1 hour of downtime

In conclusion, SLOs are crucial in setting reliability targets for a service, with a recommendation for these to be under 100% to account for dependencies and potential service failures. Utilizing tools such as SLIs, SLAs, and error budgets is essential in defining the appropriate SLO baseline, usually around or above 99%. We have also highlighted the importance of core operational metrics, as suggested by DORA, in assessing the performance of a DevOps team. These metrics, including lead time for change, deployment frequency, change failure rate, and MTTR, provide tangible criteria to measure and improve a team’s efficiency and effectiveness in service delivery and incident response.

 

Summary

DevOps presents challenges; introduce data and those challenges intensify. This book aims to explore that intricate landscape.

Consider this: immutable objects and IaC with declarative orchestration frameworks often yield secure, dependable, and repeatable results. But what happens when you must manage entities that resist immutability? Think about databases or message queues that house data that can’t be replicated easily. These technologies are integral to production but demand unique attention.

Picture this: a Formula 1 car swaps out an entire tire assembly in mere seconds during a pit stop. Similarly, with immutable objects such as load balancers, a quick destroy-and-recreate action often solves issues. It’s convenient and rapid, but try applying this quick-swap approach to databases and you risk data corruption. You must exercise caution when dealing with mutable, data-persistent technologies.

Fast forward to recent years, and you’ll find attempts to facilitate database automation via custom resource definitions (CRDs) or operators. However, such methods have proven costly and complex, shifting the trend toward managed services. Yet, for many, outsourcing data operations isn’t the ideal solution, given the priority of data security.

Navigating DevOps and SRE best practices reveals the looming complexities in managing data-centric technologies. Despite the valuable automation tools at our disposal, maintaining the highest DevOps standards while capitalizing on this automation is anything but straightforward. We’ll delve into these challenges and potential solutions in the chapters to come.

About the Author
  • David Jambor

    David Jambor is a seasoned technology expert with a 16-year career in building, designing, and managing large-scale, mission-critical systems. He has spent a decade honing his expertise in DevOps and data-persisting technologies, and he is widely regarded as an authority in the field. Currently serving as the Head of DevOps, Data, and Analytics at Amazon Web Services in the UK, David brings with him a wealth of experience from previous roles at top-tier companies such as Vodafone, Sky, Oracle, Symantec, Lufthansa and IBM. In addition to his professional achievements, David is a prominent figure in the DevOps community, frequently presenting technical and strategy-focused talks at various international events. He is also a respected judge and advisor for multiple DevOps awards, and he provides valuable support to technology vendors.

    Browse publications by this author
DevOps for Databases
Unlock this book and the full library FREE for 7 days
Start now