Situations similar to the one described above, are actually more common than desired. A system fault that had no symptoms visible before is relatively rare. Probably a subsection of Unix Administration Horror Stories (visit http://www-uxsup.csx.cam.ac.uk/misc/horror.txt) could be easily compiled that contained only stories about faults that were not noticed on time.
As experience shows, problems tend to happen when we are least equipped to solve them. To work with them on our terms we turn to a class of software, commonly referred to as network monitoring software. Such software usually allows us to constantly monitor things happening in a computer network using one or more methods and notify the persons responsible if some metric passes a defined threshold.
One of the first monitoring solutions most administrators implement is a simple shell script, invoked from crontab, that checks some basic parameters like disk usage or some service state, like Apache server. As the server and monitored parameter count grows, a neat and clean script systems starts to grow into a performance-hogging script hairball that costs more time in upkeep than it saves. While do-it-yourself crowds claim that nobody needs dedicated software for most tasks (monitoring included), most administrators will disagree as soon as they have to add switches, UPSes, routers, IP cameras, and a myriad of other devices to the swarm of monitored objects.
So what basic functionality can one expect from a monitoring solution? They are as follows:
Data gathering: This is where everything starts. Usually data will be gathered using various methods, including SNMP, agents, IPMI, and others.
Alerting: Gathered data can be compared to thresholds and alerts sent out when needed using different channels, like e-mail or SMS.
Data storage: Once we have gathered the data it doesn't make sense to throw it away, so we will often want to store it for later analysis.
Visualization: Humans are better at distinguishing visualized data than raw numbers, especially when there are huge amounts of them. As we have data already gathered and stored, it is trivial to generate simple graphs from it.
Sounds simple? That's because it is. But then we start to desire more features like easy and efficient configuration, escalations, permission delegation, and so on. If we sit down and start listing the things we want to keep an eye out for, it may turn out that area of interest extends beyond the network for example, a hard drive that has SMART errors logged, an application that has too many threads, or a UPS that has one phase overloaded. It is much easier to manage monitoring of all these different problem categories from a single configuration point.
In the quest for a manageable monitoring system wondrous adventurers stumbled upon collections of scripts much like the way they implemented themselves, obscure and not so obscure workstation-level software, and heavy, expensive monitoring systems from big vendors.
Another group is open source monitoring systems that have various sophistication levels, one of which is Zabbix.