Book Image

Datadog Cloud Monitoring Quick Start Guide

By : Thomas Kurian Theakanath
Book Image

Datadog Cloud Monitoring Quick Start Guide

By: Thomas Kurian Theakanath

Overview of this book

Datadog is an essential cloud monitoring and operational analytics tool which enables the monitoring of servers, virtual machines, containers, databases, third-party tools, and application services. IT and DevOps teams can easily leverage Datadog to monitor infrastructure and cloud services, and this book will show you how. The book starts by describing basic monitoring concepts and types of monitoring that are rolled out in a large-scale IT production engineering environment. Moving on, the book covers how standard monitoring features are implemented on the Datadog platform and how they can be rolled out in a real-world production environment. As you advance, you'll discover how Datadog is integrated with popular software components that are used to build cloud platforms. The book also provides details on how to use monitoring standards such as Java Management Extensions (JMX) and StatsD to extend the Datadog platform. Finally, you'll get to grips with monitoring fundamentals, learn how monitoring can be rolled out using Datadog proactively, and find out how to extend and customize the Datadog platform. By the end of this Datadog book, you will have gained the skills needed to monitor your cloud infrastructure and the software applications running on it using Datadog.
Table of Contents (19 chapters)
Section 1: Getting Started with Datadog
Section 2: Extending Datadog
Section 3: Advanced Monitoring

Overview of monitoring tools

In this section, you will obtain a good understanding of all the popular monitoring tools available on the market that will help you to evaluate Datadog better.

There are lots of monitoring tools available on the market, from open source, freeware products through licensed and cloud-based. While lots of tools such as Datadog are general-purpose applications that cover various monitoring types we have discussed earlier, some tools, such as Splunk and AppDynamics, address very specialized monitoring problems.

One challenge a DevOps architect would encounter when planning a monitoring solution is to evaluate the available tools for rolling out a proactive monitoring solution. In that respect, as we will see in this book, Datadog stands out as one of the best general-purpose monitoring tools as it supports the core monitoring features and also provides some non-core features such as security monitoring.

To bring some structure to the large and varied collection of monitoring tools available on the market, they are classified into three broad categories on the basis of where they actually run. Some of these applications are offered both on-premises and as a SaaS solution.

We will briefly look at what other monitoring applications are available on the market besides Datadog. Some of these applications are competing with Datadog and the rest could be complementary solutions to complete the stack of tools needed for rolling out proactive monitoring.

On-premises tools

This group of monitoring applications have to be deployed on your infrastructure to run alongside the application system. Some of these tools might also be available as an SaaS, and that will be mentioned where needed.

The objective here is to introduce the landscape of the monitoring ecosystem to newcomers to the area and show how varied it is.


Nagios is a popular, first-generation monitoring application that is well known for monitoring systems and network infrastructure. Nagios is general-purpose, open source software that has both free and licensed versions. It is highly flexible software that could be extended using hundreds of plugins available widely. Also, writing plugins and deploying them to meet custom monitoring requirements is relatively easy.


Zabbix is another popular, first-generation monitoring application that is open source and free. It's a general-purpose monitoring application like Nagios.

TICK Stack

TICK stands for Telegraf, InfluxDB, Chronograf, and Kapacitor. These open source software components make up a highly distributed monitoring application stack and it is one of the popular new-generation monitoring platforms. While first-generation monitoring tools are basically monolithic software, new-generation platforms are divided into components that make them flexible and highly scalable. The core components of the TICK Stack perform these tasks:

  • Telegraf: Generates metrics time-series data.
  • InfluxDB: Stores time-series monitoring data for it to be consumed in various ways.
  • Chronograf: Provides a UI for metrics times-series data.
  • Kapacitor: Sets monitors on metrics time-series data.


Prometheus is a popular, new-generation, open source monitoring tool that collects metrics values by scraping the target systems. Basically, a monitoring system relies on collecting data using active checks or the pull method, as we discussed earlier. Prometheus-based monitoring has the following components:

  • The Prometheus server scrapes and stores time-series monitoring data.
  • Alertmanager handles alerts and integrates with other communication platforms, especially escalation tools such as PagerDuty and OpsGenie.
  • Node exporter is an agent that queries the operating system for a variety of metrics and exposes them over HTTP for other services to consume.
  • Grafana is not part of the Prometheus suite of tools specifically, but it is the most popular data visualization tool used along with Prometheus.

The ELK Stack

The ELK Stack is one of the most popular log aggregation and indexing systems currently in use. ELK stands for Elasticsearch, Logstash, and Kibana. Each component performs the following task in the stack:

  • Elasticsearch: It is the search and analytics engine.
  • Logstash: Logstash aggregates and indexes the logs for Elasticsearch.
  • Kibana: It is the UI visualization tool that users use to interact with the stack.

The ELK Stack components are open source software and free versions are available. SaaS versions of the stack are also available from multiple vendors as a licensed software service.


Splunk is pioneering licensed software with a large install base in the log aggregation category of monitoring applications.


Zenoss is a first-generation monitoring application like Nagios and Zabbix.


Cacti is a first-generation monitoring tool primarily known for network monitoring. Its features include automatic network discovery and network map drawing.


Sensu is a modern monitoring platform that recognizes the dynamic nature of infrastructure at various levels. Using Sensu, the monitoring requirements can be implemented as code. The latter feature makes it stand out in a market with a large number of competing monitoring products.


The Sysdig platform offers standard monitoring features available with a modern monitoring system. Its focus on microservices and security makes it an important product to consider.


AppDynamics is primarily known as an Application Performance Monitoring (APM) platform. However, its current version covers standard monitoring features as well. However, tools like this are usually an add-on to a more general-purpose monitoring platform.

SaaS solutions

Most new-generation monitoring tools such as Datadog are primarily offered as monitoring services in the cloud. What this means is that the backend of the monitoring solution is hosted on the cloud, and yet, its agent service must run on-premises to collect metrics data and ship that to the backend. Some tools are available both on-premises and as a cloud service.

Sumo Logic

Sumo Logic is a SaaS service offering for log aggregation and searching primarily. However, its impressive security-related features could also be used as a Security Information and Event Management (SIEM) platform.

New Relic

Though primarily known as an APM platform initially, like AppDynamics, it also supports standard monitoring features.


Dynatrace is also a major player in the APM space, like AppDynamics and New Relic. Besides having the standard APM features, it also positions itself as an AI-driven tool that correlates monitoring events and flags abnormal activities.


Catchpoint is an end user experience monitoring or last-mile monitoring solution. By design, such a service needs to be third-party provided as the related metrics have to be measured close to where the end users are.

There are several product offerings in this type of monitoring. Apica and Pingdom are other well-known vendors in this space.

Cloud-native tools

Popular public cloud platforms such as AWS, Azure, and GCP offer a plethora of services and monitoring is just one of them. Actually, there are multiple services that could be used for monitoring purposes. For example, AWS offers CloudWatch, which is primarily an infrastructure and platform monitoring service, and there are services such as GuardDuty that provide sophisticated security monitoring options.

Cloud-native monitoring services are yet to be widely used as general-purpose monitoring solutions outside of the related cloud platform even though Google operations and Azure Monitor are full-featured monitoring platforms.

However, when it comes to monitoring a cloud-specific compute, storage, or networking service, a cloud-native monitoring tool might be better suited. In such scenarios, the integration provided by the main monitoring platform can be used to consolidate monitoring in one place.

AWS CloudWatch

AWS CloudWatch provides infrastructure-level monitoring for the cloud services offered on AWS. It could be used as an independent platform to augment the main monitoring system or be integrated with the main monitoring system.

Google operations

This monitoring service available on GCP (formerly known as Stackdriver) is a full-stack, API-based monitoring platform that also provides log aggregation and APM features.

Azure Monitor

Azure Monitor is also a full-stack monitoring platform like operations on GCP.

Enterprise monitoring solutions

Though they don't strictly fall into the category of monitoring tools used for rolling out proactive monitoring, there have been other monitoring solutions used in large enterprises to cover varied requirements such as ITIL compliance. Let's look at some of those for the completeness of this overview:

  • IBM Tivoli Netcool/OMNIbus: An SLM system to monitor large, complex networks and IT domains. It's used in large IBM setups.
  • Oracle Enterprise Manager Grid Control: System management software that delivers centralized monitoring, administration, and life cycle management functionality for the complete Oracle IT infrastructure, including non-Oracle technologies. Commonly found in large Oracle hardware and software setups.
  • HPE Oneview: Hewlett Packard's Enterprise integrated IT solution for system management, monitoring, and software-defined infrastructure. Used in big HP, TANDEM, and HPE installations.