Book Image

AWS Certified DevOps Engineer - Professional Certification and Beyond

By : Adam Book
Book Image

AWS Certified DevOps Engineer - Professional Certification and Beyond

By: Adam Book

Overview of this book

The AWS Certified DevOps Engineer certification is one of the highest AWS credentials, vastly recognized in cloud computing or software development industries. This book is an extensive guide to helping you strengthen your DevOps skills as you work with your AWS workloads on a day-to-day basis. You'll begin by learning how to create and deploy a workload using the AWS code suite of tools, and then move on to adding monitoring and fault tolerance to your workload. You'll explore enterprise scenarios that'll help you to understand various AWS tools and services. This book is packed with detailed explanations of essential concepts to help you get to grips with the domains needed to pass the DevOps professional exam. As you advance, you'll delve into AWS with the help of hands-on examples and practice questions to gain a holistic understanding of the services covered in the AWS DevOps professional exam. Throughout the book, you'll find real-world scenarios that you can easily incorporate in your daily activities when working with AWS, making you a valuable asset for any organization. By the end of this AWS certification book, you'll have gained the knowledge needed to pass the AWS Certified DevOps Engineer exam, and be able to implement different techniques for delivering each service in real-world scenarios.
Table of Contents (31 chapters)
1
Section 1: Establishing the Fundamentals
7
Section 2: Developing, Deploying, and Using Infrastructure as Code
16
Section 3: Monitoring and Logging Your Environment and Workloads
21
Section 4: Enabling Highly Available Workloads, Fault Tolerance, and Implementing Standards and Policies
27
Section 5: Exam Tips and Tricks

Reliability

There are five design principles for reliability in the cloud:

  • Automating recover from failure
  • Testing recovery procedures
  • Scaling horizontally to increase aggregate workload availability
  • Stopping guessing capacity
  • Managing changes in automation

Automating recovery from failure

When you think of automating recovery from failure, the first thing most people think of is a technology solution. However, this is not necessarily the context that is being referred to in the reliability service pillar. These points of failure really should be based on Key Performance Indicators (KPIs) set by the business.

As part of the recovery process, it's important to know both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) of the organization or workload:

  • RTO (Recovery Time Objective): The maximum acceptable delay between the service being interrupted and restored
  • RPO (Recovery Point Objective): The maximum acceptable amount of time since the last data recovery point (backup) (Amazon Web Services, 2021)

Testing recovery procedures

In your cloud environment, you should not only test your workload functions properly, but also that they can recover from single or multiple component failures if they happen on a service, Availability Zone, or regional level.

Using practices such as Infrastructure as Code, CD pipelines, and regional backups, you can quickly spin up an entirely new environment. This could include your application and infrastructure layers, which will give you the ability to test that things work the same as in the current production environment and that data is restored correctly. You can also time how long the restoration takes and work to improve it by automating the recovery time.

Taking the proactive measure of documenting each of the necessary steps in a runbook or playbook allows for knowledge sharing, as well as fewer dependencies on specific team members who built the systems and processes.

Scaling horizontally to increase workload availability

When coming from a data center environment, planning for peak capacity means finding a machine that can run all the different components of your application. Once you hit the maximum resources for that machine, you need to move to a bigger machine.

As you move from development to production or as your product or service grows in popularity, you will need to scale out your resources. There are two main methods for achieving this: scaling vertically or scaling horizontally:

Figure 1.4 – Horizontal versus vertical scaling

Figure 1.4 – Horizontal versus vertical scaling

One of the main issues with scaling vertically is that you will hit the ceiling at some point in time, moving to larger and larger instances. At some point, you will find that there is no longer a bigger instance to move up to, or that the larger instance will be too cost-prohibitive to run.

Scaling horizontally, on the other hand, allows you to gain the capacity that you need at the time in a cost-effective manner.

Moving to a cloud mindset means decoupling your application components, placing multiple groupings of the same servers behind a load balancer, or pulling from a queue and optimally scaling up and down based on the current demand.

Stop guessing capacity

If resources become overwhelmed, then they have a tendency to fail, especially on-premises, as demands spike and those resources don't have the ability to scale up or out to meet demand.

There are service limits to be aware of, though many of them are called soft limits. These can be raised with a simple request or phone call to support. There are others called hard limits. They are set a specified number for every account, and there is no raising them.

Note

Although there isn't a necessity to memorize all these limitations, it is a good idea to become familiar with them and know about some of them since they do show up in some of the test questions – not as pure questions, but as context for the scenarios.

Managing changes in automation

Although it may seem easier and sometimes quicker to make a change to the infrastructure (or application) by hand, this can lead to infrastructure drift and is not a repeatable process. A best practice is to automate all changes using Infrastructure as Code, a code versioning system, and a deployment pipeline. This way, the changes can be tracked and reviewed.