Book Image

Infrastructure Monitoring with Amazon CloudWatch

By : Ewere Diagboya
Book Image

Infrastructure Monitoring with Amazon CloudWatch

By: Ewere Diagboya

Overview of this book

CloudWatch is Amazon’s monitoring and observability service, designed to help those in the IT industry who are interested in optimizing resource utilization, visualizing operational health, and eventually increasing infrastructure performance. This book helps IT administrators, DevOps engineers, network engineers, and solutions architects to make optimum use of this cloud service for effective infrastructure productivity. You’ll start with a brief introduction to monitoring and Amazon CloudWatch and its core functionalities. Next, you’ll get to grips with CloudWatch features and their usability. Once the book has helped you develop your foundational knowledge of CloudWatch, you’ll be able to build your practical skills in monitoring and alerting various Amazon Web Services, such as EC2, EBS, RDS, ECS, EKS, DynamoDB, AWS Lambda, and ELB, with the help of real-world use cases. As you progress, you'll also learn how to use CloudWatch to detect anomalous behavior, set alarms, visualize logs and metrics, define automated actions, and rapidly troubleshoot issues. Finally, the book will take you through monitoring AWS billing and costs. By the end of this book, you'll be capable of making decisions that enhance your infrastructure performance and maintain it at its peak.
Table of Contents (16 chapters)
1
Section 1: Introduction to Monitoring and Amazon CloudWatch
5
Section 2: AWS Services and Amazon CloudWatch

Discovering the types of monitoring

We now have an understanding of what monitoring is and a brief history of its evolution in terms of techniques and tooling over time. In terms of the techniques of monitoring, there are some concepts that we should keep in mind when designing and architecting monitoring solutions. These concepts encompass any monitoring tool or service that we want to implement, even the one we will be deploying in this book. Let's now take a look at the types of monitoring and the techniques peculiar to each of them, including the pros and cons associated with both.

Proactive monitoring

Before anything goes bad, there are usually warning signs and signals given. In the earlier section, where we defined monitoring, and the Windows Event Viewer, we talked about a category of event called Warning. It is a warning signal that helps you to prepare for a failure, and in most cases, when the warning is too intermittent, it can eventually lead to a failure of that part of the system or it might affect another part of the system. Proactive monitoring helps you to prepare for the possibility of failure with warning signs, such as notifications and alarms, which can be in form of mobile push notifications, emails, or chat messages that hold details of the warning.

Acting based on these warning signs can help to avert the failure that warning sign is giving. An example is an application that used to be fast, and after a while, it starts to slow down and users start complaining about the speed. A monitoring tool can pick up that metric and show that the response time (the time it takes for a website, API, or web application) is high. A quick investigation into what makes it slow can be done, and when found, the issue can be resolved, restoring the application or service back to being faster and more responsive.

Another example of a good reactive monitoring scenario is knowing the percentage of disk space left in a server. The monitoring tool is configured to send warning alerts when the free disk space is utilized 70% and above. This will ensure that the Site Reliability Engineer or the System Administrator who is in charge to take action and empty out the disk for more space because, if that is not done, and the disk is filled up, the server where the application is deployed will no longer be available because the disk is full.

There are many scenarios where proactive monitoring can be used to predict failure, and it is immensely helpful to avoid a system from a total failure or shutdown. It requires that an action is taken as soon as signal is received. In certain scenarios, the notification can be tied to another event that is triggered to help to salvage the system from an interruption.

Proactive monitoring works with metrics and logs or historical data to be able to understand the nature of the system it is managing. When a series of events have occurred, those events are captured in the form of logs, which are then used to estimate the behavior of the system and give feedback based on that. An example is collecting logs from an nginx application server. Each request made to an nginx application server all combine to form logs on the nginx server. The logs can be aggregated, and an alert can be configured to check the number of 404 errors received within a five-minute interval. If it satisfies the threshold to be met, say, less than 20 and greater than 10 404 error messages are allowed within a 5-minute interval, an alert is triggered. This is a warning sign that the website is not fully available for users to gain access to, which is a symptom of a bigger problem that requires some investigation to find out the reason for that high number of 404 errors, within that short period of time.

Important Note

404 is a HTTP keyword for a page that does not exist.

Reactive monitoring

This type of monitoring is more of an aftermath monitoring. This is the type of monitoring that alerts you when a major incident has occurred. Reactive monitoring happens usually when the warnings of the proactive monitors are not heeded, and actions are not taken to resolve all symptoms presented. This will lead to an eventual failure of the full system or some part of it, depending on the architecture of the system, whether it is monolith or a microservice architecture. A basic example of reactive monitoring is to create a ping that continues to ping your application URL continuously and check for the HTTP status code for success, which is code 200. If it continues to get this response, it means the service is running fine and it is up. In any situation where it does not get a 200 or 2xx response, or 3xx response, it means the service is no longer available or the service is down.

Important Note

The 2xx response code means anything from 200 to 205, which means a service is OK. A 3xx response code is for redirecting; it could be permanent or temporary redirect. Reponses that indicate failure include 4xx, which are application errors, and 5xx, which are server-side errors.

This is what the monitoring tool checks and it sends a notification or alert immediately if it does not get a 200-status code. This is usually used for APIs, web applications, websites, and any application that has a URL that makes requests over the HTTP/TCP protocol.

Since this monitoring throws an alert after the failure, it is termed reactive monitoring. It is after the alert you find out something has gone wrong and then go in to restore the service and investigate what caused the failure and how to fix the issue. In most cases, you have to do root cause analysis, which will involve using techniques from proactive monitoring and look at logs and events that have occurred in the system to understand what led to the failure of the system.

Important Note

Root cause analysis is a method of problem solving that involves deep investigation into the main cause or the trigger to the cause of a system mal-function or failure. It involves analyzing different touch points of the system and corroborating all findings to come to a final conclusion to the cause of the failure. It is also called RCA for short.

Endpoint monitoring services are used for reactive monitoring such as Amazon CloudWatch Synthetics Canary, which we will be talking about later in this book. We will not only use simple endpoint pinging to get status codes but much more than that because Synthetics Canary can be configured to do more than just ping endpoints for monitoring.