Learning Nagios - Third Edition

By: Wojciech Kocjan, Piotr Beltowski

Overview of this book

Nagios, a powerful and widely used IT monitoring and management software for problem -solving. It detects problems related to your organizations infrastructure and helps in resolving the issue before it impacts the business. Following the success of the previous edition, this book will continue to help you monitor the status of network devices and also notify the system administrators of network problems. Starting with the fundamentals, the book will teach you how to install and configure Nagios for your environment. The book helps you learn how to end downtimes, adding comments and generating reports using the built-in Web interface of Nagios. Moving on, you will be introduced to the third-party web interfaces and applications for checking the status and report specific information. As you progress further in Learning Nagios, you will focus on the standard set of Nagios plugins and also focus on teach you how to efficiently manage large configurations and using templates. Once you are up to speed with this, you will get to know the concept and working of notifications and events in Nagios. The book will then uncover the concept of passive check and shows how to use NRDP (Nagios Remote Data Processor). The focus then shifts to how Nagios checks can be run on remote machines and SNMP (Simple Network Management Protocol) can be used from Nagios. Lastly, the book will demonstrate how to extend Nagios by creating custom check commands, custom ways of notifying users and showing how passive checks and NRDP can be used to integrate your solutions with Nagios. By the end of the book, you will be a competent system administrator who could monitor mid-size businesses or even large scale enterprises.
Table of Contents (19 chapters)
Learning Nagios - Third Edition
About the Authors
Chapter 1. Introducing Nagios

Imagine you're an administrator of a large IT infrastructure. You have just started receiving e-mails that a web application has suddenly stopped working. When you try to access the same page, it just does not load. What are the possibilities? Is it the router? Maybe the firewall? Perhaps the machine hosting the page is down? The server process has crashed? Before you even start thinking rationally about what to do, your boss calls about the critical situation and demands explanations. In all this panic, you'll probably start plugging everything in and out of the network, rebooting the machine... and it still doesn't help.

After hours of nervous digging into the issue, you've finally found the root cause: although the web server was working properly, it continuously timed out during communication with the database server. This is because the machine with the database did not get an IP address assigned. Your organization requires all IP addresses to be configured using the DHCP protocol and the local DHCP server ran out of memory and killed several processes, including the dhcpd process responsible for assigning IP addresses. Imagine how much time it would take to determine all this manually! To make things worse, the database server could be located in another branch of the company or in a different time zone, and it could be the middle of the night over there.

But what if you had Nagios up and running across your entire company? You would just go to the web interface and see that there are no problems with the web server and the machine on which it is running. There would also be a list of issues—the machine serving IP addresses to the entire company does not do its job and the database is down. If the setup also monitored the DHCP server, you'd get a warning e-mail that little swap memory is available or too many processes are running. Maybe it would even have an event handler for such cases to just kill or restart non-critical processes. Also, Nagios would try to restart the dhcpd process over the network in case it is down.

In the worst case, Nagios would reduce hours of investigation to ten minutes. Ideally, you would just get an e-mail that there was such a problem and another e-mail that it's already fixed. You would just disable a few services and increase the swap size for the DHCP machine and solve the problem permanently. Hopefully, it would be solved fast enough so that nobody would notice that there was a problem in the first place!