Technically, monitoring is not part of a software system running in production. The applications in a software system can run without any monitoring tools rolled out. As a best practice, software applications must be decoupled from monitoring tools anyway.
This scenario sometimes results in taking software systems to production with minimal or no monitoring, which would eventually result in issues going unnoticed, or, in the worst-case scenario, users of those systems discovering those issues while using the software service. Such situations are not good for the business due to these reasons:
- An issue in production impacts the business continuity of the customers and, usually, there would be a financial cost associated with it.
- Unscheduled downtime of a software service would leave a negative impression on the users about the software service and its provider.
- Unplanned downtime usually creates chaos at the business level and triaging and resolving such issues can be stressful for everyone involved and expensive to the businesses impacted by it.
One of the mitigating steps taken in response to a production issue is adding some monitoring so the same issue will be caught and reported to the operations team. Usually, such a reactive approach increases the coverage of monitoring organically, but not following a monitoring strategy. While such an approach will help to catch issues sometimes, there is no guarantee that an organically grown monitoring infrastructure would be capable of checking the health of the software system and warning about it, so remediation steps can be taken proactively to minimize outages in the future.
Proactive monitoring refers to rolling out monitoring solutions for a software system to report on issues with the components of the software system, and the infrastructure the system runs on. Such reporting can help with averting an impending issue by taking mitigating steps manually or automatically. The latter method is usually called self-healing, a highly desirable end state of monitoring, but hard to implement.
Implementing a comprehensive monitoring solution
Traditionally, the focus of monitoring has been the infrastructure components – compute, storage, and network. As you will see later in this chapter, there are more aspects of monitoring that would make the list complete. All relevant types of monitoring have to be implemented for a software system so issues with any component, software, or infrastructure would be caught and reported.
Setting up alerts to warn of impending issues
The monitoring solution must be designed to warn of impending issues with the software system. This is easy with infrastructure components as it is easy to track metrics such as memory usage, CPU utilization, and disk space, and alert on any usage over the limits.
However, such a requirement would be tricky at the application level. Sometimes applications can fail on perfectly configured infrastructure. To mitigate that, software applications should provide insights into what is going under the hood. In monitoring jargon, it is called observability these days and we will see later in the book how that can be implemented in Datadog.
Having a feedback loop
A mature monitoring system warning of impending issues that would help to take mitigation steps is not good enough. Such warnings must also be used to resolve issues automatically (for example, spinning off a new virtual machine with enough disk space when an existing virtual host runs out of disk space), or be fed into the redesigning of the application or infrastructure to avoid the issue from happening in the future.