Book Image

AWS DevOps Simplified

By : Akshay Kapoor
Book Image

AWS DevOps Simplified

By: Akshay Kapoor

Overview of this book

DevOps and AWS are the two key enablers for the success of any modern software-run business. DevOps accelerates software delivery, while AWS offers a plethora of services, allowing developers to prioritize business outcomes without worrying about undifferentiated heavy lifting. This book focuses on the synergy between them, equipping you with strong foundations, hands-on examples, and a strategy to accelerate your DevOps journey on AWS. AWS DevOps Simplified is a practical guide that starts with an introduction to AWS DevOps offerings and aids you in choosing a cloud service that fits your company's operating model. Following this, it provides hands-on tutorials on the GitOps approach to software delivery, covering immutable infrastructure and pipelines, using tools such as Packer, CDK, and CodeBuild/CodeDeploy. Additionally, it provides you with a deep understanding of AWS container services and how to implement observability and DevSecOps best practices to build and operate your multi-account, multi-Region AWS environments. By the end of this book, you’ll be equipped with solutions and ready-to-deploy code samples that address common DevOps challenges faced by enterprises hosting workloads in the cloud.
Table of Contents (19 chapters)
1
Part 1 Driving Transformation through AWS and DevOps
5
Part 2 Faster Software Delivery with Consistent and Reproducible Environments
9
Part 3 Security and Observability of Containerized Workloads
13
Part 4 Taking the Next Steps

AWS and DevOps – a perfect match

Gartner’s Cloud Infrastructure and Platform Services (CIPS) Magic Quadrant report (https://www.gartner.com/doc/reprints?id=1-2AOZQAQL&ct=220728) positioned AWS as a leader in both their metrics – Ability to Execute and Completeness of Vision. This speaks volumes about the reliability of the platform in hosting your mission-critical workloads that can adapt to changing customer demands. DevOps methodologies complement this with time-tested ways of working that have a positive impact on the overall IT service delivery.

However, let’s not sugar coat this – efficiently operating your enterprise-grade software environments on AWS can be challenging. The cloud provider has been expanding the list of services offered from the beginning. Your usage, implementation, and solid future strategy will be key.

Even before we dive into anything regarding AWS or DevOps, let’s first go through some real-life examples, covering aspects that are relevant to any software professional. The idea here is to discuss a few approaches that have helped me over many years of maintaining or writing software. I hope these topics are useful for you as well.

Production-like environments

While working as a DevOps specialist a few years ago, I was tasked with helping developers in my team to ship features faster. Upon understanding the challenges they faced in the huge monolithic LAMP stack (Linux, Apache, MySQL, and PHP) application, MySQL surfaced as a common pain point across the board. The continuously increasing size of these on-premises production databases (1 TB+) meant that the developers were frequently unaware of the challenges the application would face in the live environment when the production load kicked in. The application (warehouse management system) was used across several countries in Europe, with each country having its own dedicated MySQL instance and read replicas. Every minute of downtime or system degradation directly impacted shipping, packaging, and order invoicing. During that time, the developers were using local MySQL instances to develop and test new features and would later ship them off to staging, followed by production rollout.

With the problem statement clear, a promising next step was to enable them to develop and test in production-like environments. This would allow them to see how their systems were reacting to evolving customer usage, and to have a better understanding of the production issues at an earlier stage.

With a strong understanding of Bash scripting (…or at least I thought so) and basic MySQL administration skills, I decided to build shell scripts that could prepare these testbed environments, and refresh them with the most recent production data, on a daily basis. A request for a similar number of new MySQL servers was raised with the on-premises infrastructure team. They provisioned them in a week, set up the operating system, and required libraries for MySQL, before handing over to me my newly acquired liability. Moving forward, all maintenance and upkeep of these servers were my responsibility. I later scheduled cronjobs on all these servers to run the scripts created previously. They would perform the following steps at a pre-configured time of the day:

  1. Copy the previous day’s production backups from the fileserver to the local filesystem.
  2. Remove all existing data in the local MySQL database.
  3. Dump the new backup data into the local MySQL database.
  4. Perform data anonymization and remove certain other confidential tables.

You can see how the different entities communicated in the following figure:

Figure 1.1: Bash scripts to manage database operations

Figure 1.1: Bash scripts to manage database operations

The following day, the developers had recent production replicas available at their disposal, anonymized for development and testing. This was a huge step forward and the developers were excited about this as they were now developing and testing against environments that matched production-level scale and complexity. The excitement, however, was soon overshadowed by new issues. What happens if a particular cronjob execution does not terminate successfully? What if database import takes forever? What if fileservers are not responding? What if the local storage disk is full?

How about a parent orchestrator script that manages all these edge cases? Brilliant idea! I spent the next few days building an orchestrator layer that coordinated the execution of all these scripts. It was not too long before I had a new set of Bash scripting problems to solve: tracking child process executions, exception handling, graceful cleanups, handling kernel signals…the list goes on. Doing all this in Bash was a Herculean task. What started as a simple set of scripts now evolved into a framework that required a lot of investment in time and effort.

Days and weeks passed, and the framework kept evolving. Bugs were identified. New feature requests from the developers were implemented. The framework now had a fancy new name: Bicycleend-to-end life cycle management for Bash scripts. Finally, after two months, a YAML config-driven Bash framework came into being that sent error notifications on Slack, executed report attachments via email, orchestrated and measured the entire flow of operations, and was generic enough to be adopted by other teams easily. This was not solely for managing database operations, but rather any collection of Bash scripts put together to accomplish a task, as you can see in Figure 1.2:

Figure 1.2: Bicycle framework – managing the life cycle of Bash scripts

Figure 1.2: Bicycle framework – managing the life cycle of Bash scripts

The framework served the team well for two to three weeks and then the next wave of challenges became evident:

  • Data import time was increasing exponentially. To execute parallel MySQL threads, I needed more compute power – so, another request to the infrastructure team was needed.
  • High I/O operations required more performant disks. To upgrade these, the infrastructure team had to raise a new purchase order and install new SSDs.
  • Developers now wanted the capability to be able to maintain the latest X versions of the backups. Where should these be stored?

As you might have noticed already, the framework had matured considerably, but it was as good as the availability, scalability, and reliability of the underlying infrastructure. With underpowered machines, disk capacity issues, and frequent network timeouts, there was only so much the framework could do.

What started as an initiative to increase development velocity soon transitioned into a technology-focused framework. Were there some technical learnings from this exercise? A lot. In fact, there were so many that I wrote a Best Practices for Bash Scripts blog afterward, which can be found at https://medium.com/p/17229889774d. Did it resolve all the problems the developers were facing? Probably not. Before embarking on these lengthy development cycles and going down the rabbit hole of solving technical problems, it would have been better to build a little, test a little, and always challenge myself with the question, Is this what my customers really need?

Knowing your customers (and their future needs)

It’s of paramount importance to know the end beneficiary of your work. You will always have a customer – internal, external, or both. If you are not clear about it, I would strongly recommend discussing this with your manager or colleagues to understand for whom the solution is being built. It’s essential to put yourself in your customer’s shoes and approach the problems and solutions from their perspective. Always ask yourself whether what you are doing will address your customers’ issues and delight them. It took me at least two or three months to bring the Bash framework up to the desired level of performance and utility. However, a Bash script orchestrator was not something my customers (developers) required. They needed a simple, scalable, and reliable mechanism to reproduce production-like databases. In fact, operating this framework was an additional overhead for them. It was not helping them with what they did best – writing code and delivering business outcomes.

Focusing on iterative development and failing fast

Another prerequisite for delivering impactful customer solutions is to establish an iterative working model: deliver work into manageable pieces, monitor the success metrics, and establish a feedback mechanism to validate progress. Applying this to the aforementioned situation would have meant that developers had complete visibility of the implementation from the beginning. Collaborating together, we could have defined the success criteria (time to provision production database replicas) at the very start of the process.

This is very similar to the trunk-based development approach used by high-performing software teams. They frequently merge small segments of working code in the main branch of the repository, which greatly improves visibility and highlights problems more quickly.

Prioritizing business outcomes over technology

As an IT professional, it is very easy to have tunnel vision, whereby the entire focus is on technical implementations. Establishing feedback mechanisms to ensure that the business outcomes are met will avoid such situations and will help with effective team communications while accelerating delivery. This is what DevOps is all about.

Offering solutions as a service

It is important to take the cognitive load off of your customers and offer them services that are operationally light, are easy to consume, and require little to no intervention. This enables them to focus on their core job, without worrying about any add-on responsibilities.

Offering well-documented and easy-to-consume interfaces (or APIs) would have been far easier for the developers in my team rather than onboarding them to learn how to use the custom-built Bash framework. What they really needed was an easy method to provision production-like databases. Exposing them to underlying infrastructure scalability issues and Bash internals was an unnecessary cognitive load that ideally should have been avoided.

Similarly, the on-premises infrastructure team’s focus on their customer (myself) could have eased my job of requesting new infrastructure resources, without having to go through all the logistics and endure a long wait until something tangible was ready for use.

This is an area AWS excels in. It reduces the cognitive load for the end user and enables them to deliver business outcomes, instead of focusing on the underlying technology. The customer consumes the services and has less to worry about when it comes to the availability, scalability, and reliability of the service, as well as the underlying infrastructure. It offers these services in the form of APIs with which the developers can interact, using the tools of their choice.

Mapping the solution components to the AWS services

One exercise that will often help you when working with AWS is designing your solution, identifying key components, and then evaluating whether some of these un-differentiated tasks can be offloaded to AWS services. It’s good practice to compare these choices and to conduct a cost-benefit analysis before adopting the services immediately. Let’s dive quickly into the main components of the database replicas’ example discussed in the Production-like environments section previously, and consider whether AWS services could have been an option to avoid reinventing the wheel:

Figure 1.3: Mapping Bash framework components to AWS services

Figure 1.3: Mapping Bash framework components to AWS services

From a timeline perspective, building the entire stack from scratch, as seen in Figure 1.3, took around three months, but you can provision similar services in your AWS account in less than three hours. That’s the level of impact AWS can have in your DevOps journey. In addition, it’s important to understand that you need not go all-in on AWS. If Amazon S3 (data storage and retrieval service) is all that was needed, then retaining the other components on-premises and using AWS as an extension of the solution could also be considered as an approach for solving the problems at hand.

To summarize, understand your needs, evaluate the benefits provided by AWS services, and adopt only what helps you in the long run.

Now, let’s discuss another instance in which I helped the same developers scale their continuous integration activities with GitLab Continuous Integration and Continuous Deliver, but this time, with AWS. If you have not been exposed to these terms before, continuous integration is a practice that automates the integration of code changes from multiple developers into a single project, and GitLab is a software development platform that helps with the adoption of DevOps practices.

Scaling with the cloud

The GitLab Continuous Integration and Continuous Delivery suite helps software teams to collaborate better and frequently deploy small manageable chunks of code into production environments. My company at that time was using a self-hosted, on-premises version of GitLab Continuous Integration and Continuous Delivery.

There are three main architectural components of GitLab Continuous Integration and Continuous Delivery that are especially relevant to this discussion:

  • Control plane: This is the layer that interacts with the end user, so APIs, web portals, and so on all fall into this category. This was owned and managed by the central infrastructure and operations team.
  • Runners: Runners are the compute environments used by GitLab to run/execute the pipeline stages and the respective processes. As soon as developers commit code to their repository, a pipeline triggers and executes the pipeline stages in sequence by leveraging these compute resources. Due to the heterogeneous project requirements, each team owned and operated their own runners. Based on the technology stacks they worked with, they could decide which type of compute resources would best fit their needs. As a fallback mechanism, there was also a shared pool of GitLab runners, which could be used by teams. However, as you can imagine, these were not very reliable in terms of availability and spiky workloads. For example, if you need two cores of CPU and 1 GB RAM for your Java build immediately, the release of an urgent patch to production could be a challenge. Therefore, it was generally recommended to begin with these but to switch to custom-built, self-managed runners when needed.
  • Pipelines: Lastly, if you have used Jenkins or AWS CodePipeline in the past, GitLab Pipelines are similar in terms of functionality. You can define different phases of your software delivery process in a YAML file, commit it alongside your code, and let GitLab manage your software delivery from there on.

At that time, I was supporting five to six software projects for the developers of my team. Having started with the shared pool of GitLab runners hosted on-premises, we were able to leverage the compute resources for our needs, for roughly 80-120 builds per day. However, with increasing adoption throughout the company, the resources on these runners would frequently become exhausted, leading to several pipeline processes waiting for execution. Additionally, occasional VM failures meant that all software delivery processes dependent on these shared resources across the company would come to a halt. This was certainly not a good situation to be in. The central ops team added more resources to this shared pool, but this was still a static server farm, whereby my team’s build jobs were dependent on how others used these resources.

Having learned from the issues relating to unscalable on-premises infrastructure during the database setup, as discussed previously, I decided to leverage AWS cloud capabilities this time. Discussions with the developers (customers) led to the definition of the following requirements, which were all fulfilled with AWS services out of the box:

  • Flexibility to scale infrastructure up/down
  • Usage monitoring for the runners in AWS
  • Less operational work

Across the entire solution design, the only effort required from my side was the code to register/de-register these runners with the control plane when the compute instances were started or stopped.

The final design (see Figure 1.4) leveraged auto-scaling groups in AWS, which is a mechanism to dynamically scale up or trim down the compute resources depending on the usage patterns:

Figure 1.4: GitLab Continuous Integration and Continuous Delivery with runners hosted in AWS

Figure 1.4: GitLab Continuous Integration and Continuous Delivery with runners hosted in AWS

As soon as the new servers were started, they registered themselves with the GitLab control plane.

Extending your on-premises IT landscape with AWS

AWS cloud adoption need not always translate into shutting down data centers and migrating applications through a lift-and-shift approach. The real value lies in starting small, measuring impact, and utilizing cloud offerings as a natural extension to your on-premises IT landscape.

As seen in the previous scenario, the GitLab control plane still continued to remain in the on-premises data center, while the runners leveraged the elasticity of the cloud. This gave the developers immediate benefits in terms of compute selection, scalability, reliability, and elasticity of the cloud. Amazon EC2 is the Elastic Compute Cloud offering, which offers scalable computing capacity for virtual servers, security, networking, and storage. Combining this with the EC2 Auto Scaling service, I configured capacity thresholds that allowed us to maintain a default set of runners and scale, with the demand driven by real usage.

Infrastructure management in data centers usually lacks this level of flexibility unless there are interfaces or resource orchestrators made available to the end user for provisioning resources in an automated way.

Collecting metrics for understanding resource usage

Requirements relating to measuring usage and alerting on thresholds were further simplified with the use of Amazon CloudWatch. CloudWatch is a metrics repository in which different AWS services and external applications publish usage data. EC2, like other services, makes data points such as CPU, memory, and storage consumption available, which helps to identify threshold breaches, resulting in the automation of scaling decisions.

Having access to these metrics out of the box is a considerable automation accelerator for two reasons. You need not invest any effort in capturing this data with third-party agents and the close-to-real-time nature of this data helps with dynamic decision-making. Furthermore, AWS also offers native integrations around alarms and service triggers with CloudWatch. So, extending these to usual notification mechanisms, such as email, SMS, or an external API, is generally a low-effort implementation.

Paying for what you use

AWS offers a pay-as-you-go pricing model. In contrast to this, on-premises resources come with a fixed-priced costing model and require a lot of time and operational effort. Combining this with metrics from CloudWatch, it was possible to automatically scale down the EC2 compute resources during periods of low usage (after work hours and weekends). This further reduced AWS costs by ~20-30%.

Generally speaking, on-premises infrastructure resources are mostly over-provisioned. This is done to maintain an additional buffer of resources, should ad-hoc demand require it. AWS, on the other side, offers the capability to right-size all your resources based on your exact needs at the moment. This is a big win for agile teams to respond to changing customer demands and usage patterns.

Simplifying service delivery through cloud abstractions

Software technology these days is all about abstractions. This is a topic that we will explore in more depth in Chapter 2, Choosing the Right Cloud Service. All AWS services abstract the complexity from the end users around operational aspects. As a result, end users are empowered to focus on the differentiating features and business outcomes. Earlier, we discussed the need to take the cognitive load off the developers. AWS makes this a reality, and you can develop proofs of concept, demos, and production-grade applications in hours or days, which previously took months.

Leveraging the infrastructure elasticity of the cloud

AWS cloud benefits are not limited to procuring more resources when needed but are also about contracting when possible. Of course, this needs to align with the type of workload you plan to run in the cloud. Sometimes, there are known events that would require more resources to handle the increased load, such as festive sales and marketing initiatives. In other cases, when the usage spikes cannot be determined in advance, you can leverage AWS’ auto-scaling capabilities, as we did for GitLab runners.

So far, we have discussed two solution implementations and how adopting cloud services gives a big boost to reliability and scalability, leading to better customer outcomes. Next, let’s learn about some DevOps methodologies that help accelerate the software delivery process. We will later map these key areas to certain AWS services.

DevOps methodologies to accelerate software delivery

As we discussed at the beginning of the chapter, successful organizations use software automation to catapult their digital transformation journey.

As highlighted in the 2022 State of DevOps Report by DORA (https://cloud.google.com/devops/state-of-devops), DevOps methodologies positively influence your team culture and foster engineering best practices to help you be able to ship software with increased velocity and better reliability. In software engineering, the following principles have been well established and are known to optimize the way teams work and collaborate.

Continuous integration

Continuous Integration (CI) is a software engineering best practice that advocates the frequent merging of code from all software developers in a team to one central repository. This increases the confidence in and visibility of new features being released to the customers. At the same time, automated tests make releasing code multiple times a day seamless and easy. Developers also get quick feedback regarding any bugs that might have been introduced into the system as a result of implementing features in isolation.

Continuous delivery

Continuous Delivery (CD) is the practice of producing code in short cycles that can be released to production at any time. Automatically deploying to a production-like environment is key here. Fast-moving software teams leverage CD to confidently roll out features or patches, on demand, with lightweight release processes.

Continuous deployment

Continuous deployment enables teams to automatically release the code to production. This is indicative of high DevOps maturity and rock-solid automation practices. Using continuous deployment, code is automatically deployed to the production environments.

This requires deep integration into how the software stack functions. All ongoing operations and customer requests are automatically taken care of, and the release process is hardly noticeable to the end user.

In later chapters, we will go through some hands-on examples around CI/CD processes and use native AWS services to see things in action.

Infrastructure as Code (IaC)

Managing AWS infrastructure components with code, using SDKs, APIs, and so on, makes it very convenient to reliably manage environments at scale. Unlike static provisioning methods used on-premises, these practices enable the creation of complete infrastructure stacks with the use of programmable workflows.

This also reduces the ownership silos across the development and operations teams. The developers are free to use familiar programming language constructs and have end-to-end control of the foundational infrastructural elements.

Effective communication and collaboration

Collaboration between team members is crucial for faster software delivery. It is advisable to have small teams that share a common goal. Amazon uses the concept of the Two-Pizza Team rule, which suggests creating a workgroup that is no larger than one that can be fed by two pizzas, so roughly an 8-to-10-member team.

Furthermore, this enables the team to not just deliver software but own it end to end. Operations, deployment, support, and feature development are all owned by the members of this team.

Now, since we have a good understanding of key DevOps methodologies, let’s dive into several AWS services that make this a reality in the cloud.