Book Image

Infrastructure Monitoring with Amazon CloudWatch

By : Ewere Diagboya
Book Image

Infrastructure Monitoring with Amazon CloudWatch

By: Ewere Diagboya

Overview of this book

CloudWatch is Amazon’s monitoring and observability service, designed to help those in the IT industry who are interested in optimizing resource utilization, visualizing operational health, and eventually increasing infrastructure performance. This book helps IT administrators, DevOps engineers, network engineers, and solutions architects to make optimum use of this cloud service for effective infrastructure productivity. You’ll start with a brief introduction to monitoring and Amazon CloudWatch and its core functionalities. Next, you’ll get to grips with CloudWatch features and their usability. Once the book has helped you develop your foundational knowledge of CloudWatch, you’ll be able to build your practical skills in monitoring and alerting various Amazon Web Services, such as EC2, EBS, RDS, ECS, EKS, DynamoDB, AWS Lambda, and ELB, with the help of real-world use cases. As you progress, you'll also learn how to use CloudWatch to detect anomalous behavior, set alarms, visualize logs and metrics, define automated actions, and rapidly troubleshoot issues. Finally, the book will take you through monitoring AWS billing and costs. By the end of this book, you'll be capable of making decisions that enhance your infrastructure performance and maintain it at its peak.
Table of Contents (16 chapters)
1
Section 1: Introduction to Monitoring and Amazon CloudWatch
5
Section 2: AWS Services and Amazon CloudWatch

Introducing monitoring

Man has always found a way to take note of everything. In ancient times, man invented a way to create letters and characters. A combination of letters and characters made a word and then a sentence and then paragraphs. This information was stored in scrolls. Man also observed and monitored his environment and continued to document findings and draw insights based on this collected information. In some cases, this information might be in a raw form with too many details that might not be relevant or might have been processed into another form that removed irrelevant information, to allow for better understanding and insight.

This means the data was collected as historic data after an activity occurred. This could be a memorable coronation ceremony, a grand wedding occasion, or even a festival or a period of war or hunger and starvation. Whatever that activity is in time, it is documented for various purposes. One of the purposes is to look at the way things were done in the past and look for ways it can either be stopped or made better. There is a saying that goes as follows:

"If you cannot measure it, you cannot improve it."

– Lord Kelvin (1824-1907)

So, being able to make records of events is not only valuable in helping to draw insights but can also spur the next line of action based on the insight that has been drawn from the data.

Borrowing from this understanding of how man has been recording, documenting, and making records, we can list two major reasons for monitoring data —to draw insights from the data collected and to act based on the insights received. This can be taken into consideration with a system that we build too. For every system man has developed, from the time of the pyramids of Egypt, where massive engineering was needed to draw, architect, and build the pyramids and other iconic structures, documentation of historic works has been very essential. It helped the engineers in those days to understand the flaws in earlier designs and structures, to figure out ways the new structures could be designed, and to eventually fix the flaws that were identified. It is usually a continuous process to keep evaluating what was done before to get better and better with time using these past experiences and results. Documented information is also very helpful when the new project to be embarked on is bigger than the earlier one. This gives foundational knowledge and understanding of what can be done for a new and bigger project due to the historical metrics that have been acquired.

Applying new methods go beyond just the data that has been collected—there is also the culture and mindset of understanding that change is constant and always being positioned to learn from earlier implementations. Building new systems should be about applying what has been learned and building something better and, in some cases, improving the existing system based on close observation:

Figure 1.1 – A basic monitoring flow

Figure 1.1 – A basic monitoring flow

What we have been explaining so far is monitoring. Monitoring is the act or process of collecting, analyzing, and drawing insights from data that has been collected from the system. In software systems and infrastructure, this includes analyzing and drawing insights from the data that has been collected from systems performing specific tasks or multiple tasks. Every system or application is made up of a series of activities, which we also call events. Systems in this context can mean mechanical systems (cars, industrial machines, or trucks), electrical systems (home appliances, transformers, industrial electronics machines, or mobile phones), or computer systems (laptops, desktops, or web or mobile applications).

Algorithms are the bedrock of how complex systems are built, a step-by-step approach to solving a problem. When a complex system is built, each of these step-by-step processes that have been built in to solve a specific problem or set of problems can be called an event.

Consider the following example of the process of making a bottle of beer:

  1. Malted barley or sorghum is put in huge tanks and blended.
  2. Yeast is added to the mixture to allow the fermentation process to occur to generate alcohol.
  3. After fermentation, sugar is added to the mixture to sweeten it.
  4. The beer is stored in huge drums.
  5. An old bottle undergoes a mechanical process that washes and disinfects the bottle.
  6. The washed bottle is taken through a conveyor belt to be filled up with beer.
  7. After being filled up, the bottle is corked under high pressure with CO2.
  8. The bottle is then inserted into a crate with other bottles.

In this algorithm of preparing beer, there are various stages; each stage has various touchpoints, and each of these touchpoints is a potential for failure. The failure can be within a process itself or during interaction with other processes. For example, the process of fermentation might not be properly done if the right amount of yeast is not added to the sorghum or if the case of yeast and millet is not air-tight enough, because air is not needed for the fermentation process. Challenges could also arise from the machine that sends the bottle to be crated after corking—there's the possibility of the conveyor belt failing or, during corking, the bottle might explode. These are possibilities and touchpoints that needs close monitoring.

In a nutshell, when a system is designed to perform a specific task or a group of systems are integrated to achieve a common goal, there are always touchpoints both internally and externally that need to be understood. Understanding these touchpoints includes metrics that can be derived from each step of the operation, what normal or good working conditions looks like for both internal and an integration of two systems, and globally acceptable standards. All of this information helps in detecting and finding anomalies when they occur. The only way to detect an activity or metric is an anomaly is by monitoring the system, then collecting and analyzing the data and comparing it with perfect working conditions.

Since we have defined monitoring, the next step thing is to do a sneak peek into the history of monitoring and how it has evolved over time, down to present-day monitoring tools and techniques.

The history of monitoring

We can say for certain that monitoring is as old as man. As mentioned earlier, it is as old as when man started building systems and reviewing what had been done to find faults and fix them and find ways to improve when building a new system. But this book is focused on software monitoring, so we will stick to that.

A computer is made of up of different components, such as the memory, CPU, hard disk, and operating system software. The ability to know what is going on with any of the components goes back to your operating system events logs. The Windows operating system developed the Event Viewer in 1993 as part of the Windows NT system. This internal application takes note of every event in the system, which together forms a list of logs. These logs help to track both core operating system activities that keep the system running and the events of other applications that are installed in the operating system. The Event Viewer can log both normal activities and system failures. The following screenshot shows the Event Viewer:

Figure 1.2 – Windows Event Viewer

Figure 1.2 – Windows Event Viewer

Windows Event Viewer categorizes events into three groups, as shown in Figure 1.2: Custom Views, Windows Logs, and Application and Services Logs. The events captured are also divided into the following categories:

  • Error: This means that the event did not occur and this gives details about the event that failed with other relevant information.
  • Warning: This is a signal about an anomaly that could lead to an eventual error and requires attention.
  • Information: This explains the notification of a successful event.
  • Audit Success: This means that audit of an event was successful.
  • Audit Failure: This means that audit of an event was unsuccessful.

The logs in the Windows Event Viewer look like this:

Figure 1.3 – Event Viewer logs

Figure 1.3 – Event Viewer logs

Figure 1.3 shows a list of events, which eventually forms a log.

Over time, monitoring systems have grown and evolved. Due to the importance of monitoring in applications, different organizations have designed purpose-built monitoring systems. There is a whole industry around application and system monitoring, and it has gone from just events and logging to alerting and graph visualization of log data. The list of monitoring tools and services goes on and on. Here is a summarized list:

  • Datadog
  • Nagios Core
  • ManageEngine OpManager
  • Zabbix
  • Netdata
  • Uptime Robot
  • Pingdom
  • Amazon CloudWatch

Now that we have been able to show the meaning and a brief history of how monitoring started, we understand that monitoring is about making records of events, and the categorization of events can have labels as warnings of something to come or something that has happened and needs resolution. Bearing that in mind, let's go deeper into the types of monitoring that are available based on the way we respond to the type of metrics and the information from logs of events.