Book Image

Learning Nagios 3.0

Book Image

Learning Nagios 3.0

Overview of this book

Table of Contents (16 chapters)
Learning Nagios 3.0
Credits
About the Author
About the Reviewer
Preface

Soft and Hard States


Nagios works by checking if a particular host or service is working correctly and storing its status. Because the status of a service is only one of the four possible values, it is crucial that it actually reflects what the current status is. In order to avoid detecting random and temporary problems, Nagios uses soft and hard states to describe what the current status of a host or service is.

Imagine that an administrator is restarting a web server and this operation makes connection to the web pages unavailable for five seconds. As, usually, such restarts are done at night to lower the number of users affected, this is an acceptable period of time. However, a problem might arise when Nagios tries to connect to the server and notices that it is actually down. If it relies only on a single result, Nagios would trigger an alert that a web server is down. It would actually be up and running again in a few seconds, but it could take a couple of minutes for Nagios to find that out.

To handle situations when a service is down for a very short time, or the test has temporarily failed, soft states were introduced. When the status of a check is unknown, or it is different from the previous one, Nagios will retest the host or service several times to make sure that the change is persistent. The number of checks is specified in the host or service configuration. Nagios assumes that the new result is a soft state. After additional tests have verified that the new state is permanent, it is considered a hard state.

Each host and service definition specifies the number of retries to be performed before it can be assumed that a change is permanent. This allows more flexibility over how many failures should be treated as an actual problem instead of a temporary one. Setting the number of checks to one will cause all changes to be treated as hard instantly. The following is an illustration of soft and hard state changes, assuming that number of checks to be performed is set to three:

This feature allows ignoring short outages of a service. It is also very useful for performing checks that can periodically fail even if everything is working correctly. Monitoring devices over SNMP is also an example where a single check might fail, but the check will eventually succeed during the second or third check.