Book Image

Datadog Cloud Monitoring Quick Start Guide

By : Thomas Kurian Theakanath
Book Image

Datadog Cloud Monitoring Quick Start Guide

By: Thomas Kurian Theakanath

Overview of this book

Datadog is an essential cloud monitoring and operational analytics tool which enables the monitoring of servers, virtual machines, containers, databases, third-party tools, and application services. IT and DevOps teams can easily leverage Datadog to monitor infrastructure and cloud services, and this book will show you how. The book starts by describing basic monitoring concepts and types of monitoring that are rolled out in a large-scale IT production engineering environment. Moving on, the book covers how standard monitoring features are implemented on the Datadog platform and how they can be rolled out in a real-world production environment. As you advance, you'll discover how Datadog is integrated with popular software components that are used to build cloud platforms. The book also provides details on how to use monitoring standards such as Java Management Extensions (JMX) and StatsD to extend the Datadog platform. Finally, you'll get to grips with monitoring fundamentals, learn how monitoring can be rolled out using Datadog proactively, and find out how to extend and customize the Datadog platform. By the end of this Datadog book, you will have gained the skills needed to monitor your cloud infrastructure and the software applications running on it using Datadog.
Table of Contents (19 chapters)
1
Section 1: Getting Started with Datadog
9
Section 2: Extending Datadog
14
Section 3: Advanced Monitoring

Types of monitoring

There are different types of monitoring. All types of monitoring that are relevant to a software system must be implemented to make it a comprehensive solution. Another aspect to consider is the business need of rolling out a certain type of monitoring. For example, if customers of a software service insist on securing the application they subscribe to, the software provider has to roll out security monitoring.

(The discussion on types of monitoring in this section originally appeared in the article Proactive Monitoring, published by me on DevOps.com.)

Figure 1.8 – Types of monitoring

Figure 1.8 – Types of monitoring

Let us now explore these types of monitoring in detail.

Infrastructure monitoring

The infrastructure that runs the application system is made up of multiple components: servers, storage devices, a load balancer, and so on. Checking the health of these devices is the most basic requirement of monitoring. The popular monitoring platforms support this feature out of the box. Very little customization is required except for setting up the right thresholds on those metrics for alerting.

Platform monitoring

An application system is usually built using multiple third-party tools such as the following:

  • Databases, both RDBMS (MySQL, Postgres) and NoSQL (MongoDB, Couchbase, Cassandra) data repositories
  • Full-text search engines (Elasticsearch)
  • Big data platforms (Hadoop, Spark)
  • Messaging systems (RabbitMQ)
  • Memory object caching systems (Memcached, Redis)
  • BI and reporting tools (MicroStrategy, Tableau)

Checking the health of these application components is important too. Most of these tools provide an interface, mainly via the REST API, that can be leveraged to implement plugins on the main monitoring platform.

Application monitoring

Having a healthy infrastructure and platform is not good enough for an application to function correctly. Defective code from a recent deployment or third-party component issues or incompatible changes with external systems can cause application failures. Application-level checks can be implemented to detect such issues. As mentioned earlier, a functional or integration test would unearth such issues in a testing/staging environment, and an equivalent of that should be implemented in the production environment also.

The implementation of application-level monitoring could be simplified by building hooks or API endpoints in the application. In general, improving the observability of the application is the key.

Monitoring is usually an after-thought and the requirement of such instrumentation is overlooked during the design phase of an application. The participation of the DevOps team in design reviews improves the operability of a system. Planning for application-level monitoring in production is one area where DevOps can provide inputs.

Business monitoring

The application system runs in production to meet certain business goals. You could have an application that runs flawlessly on a healthy infrastructure but still, the business might not be meeting its goals. It is important to provide that feedback to the business at the earliest opportunity to take corrective actions that might trigger enhancements of the application features and/or the way the business is run using the application.

These efforts should only complement the more complex BI-based data analysis methods that could provide deeper insights into the state of the business. Business-level monitoring can be based on transactional data readily available in the data repositories and the data aggregates generated by BI systems.

Both application- and business-level monitoring are company-specific, and plugins have to be developed for such monitoring requirements. Implementing a framework to access standard sources of information such as databases and REST APIs from the monitoring platform could minimize the requirement of building plugins from scratch every time.

Last-mile monitoring

A monitoring platform deployed in the same public cloud or data center environment as where the applications run cannot check the end user experience. To address that gap, there are several SaaS products available on the market, such as Catchpoint and Apica. These services are backed up by actual infrastructure to monitor the applications in specific geographical locations. For example, if you are keen on knowing how your mobile app performs on iPhones in Chicago, that could be tracked using the service provider's testing infrastructure in Chicago.

Alerts are set up on these tools to notify the site reliability engineering team if the application is not accessible externally or if there are performance issues with the application.

Log aggregation

In a production environment, a huge amount of information is logged in various log files, by operating system, platform components, and application. They will get some attention when issues happen and normally are ignored otherwise. Traditional monitoring tools such as Nagios couldn't handle the constantly changing log files except for alerting on some patterns.

The advent of log aggregation tools such as Splunk changed that situation. Using aggregated and indexed logs, it is possible to detect issues that would have gone unnoticed earlier. Alerts can be set up based on the info available in the indexed log data. For example, Splunk provides a custom query language to search indexes for operational insights. Using APIs provided by these tools, the alerting can actually be integrated with the main monitoring platform.

To leverage the aggregation and indexing capabilities of these tools, structured data outputs can be generated by the application or scripts that will be indexed by the log aggregation tool later.

Meta-monitoring

It is important to make sure that the monitoring infrastructure itself is up and running. Disabling alerting during deployment and forgetting about enabling it later is one of the common oversights that has been seen in operations. Such missteps are hard to monitor and only improvements in the deployment process can address such issues.

Let's look at a couple of popular methods used in meta-monitoring:

Pinging hosts

If there are multiple instances of the monitoring application running, or if there is a standby node, then cross-checks can be implemented to verify the availability of hosts used for monitoring. In AWS, CloudWatch can be used to monitor the availability of an EC2 node.

Health-check for monitoring

Checking on the availability of monitoring UI and activity in a monitoring tool's log files would ensure that the monitoring system itself is fully functional and it continues to watch for issues in the production environment. If a log aggregation tool is used, tracking the application's log files would be the most effective method to check whether there is any activity in the log file. The same index can also be queried for any potential issues by using standard keywords such as Error and Exception.

Noncore monitoring

The monitoring types that have been discussed thus far make up the components of a core monitoring solution. You will see most of these monitoring categories in a comprehensive solution. There are more monitoring types that are highly specialized and that would be important components in a specific business situation.

Security monitoring

Security monitoring is a vast area by itself and there are specialized tools available for that, such as SIEM tools. However, that is slowly changing and general-purpose monitoring tools including Datadog have started offering security features to be more competitive in the market. Security monitoring usually covers these aspects:

  • The vulnerability of the application system, including infrastructure, due to changes made to its state
  • The vulnerability of infrastructure components with respect to known issues
  • Monitoring attacks and catching security breaches

As you can see, these objectives might not strictly be covered by the core monitoring concepts we have discussed thus far and we'll have to bring in a new set of terminology and concepts to understand it better, and we will look at those details later in the book.

Application Performance Monitoring (APM)

As the name suggests, APM helps to fine-tune the application's performance. This is made possible by the improved observability of the application system in which the interoperability of various components is made more visible. Though these monitoring tools started off as dedicated APM solutions, full-stack monitoring is available these days so they can be used for general-purpose monitoring.