Nagios works by checking if a particular host or service is working correctly and storing its status. Because the status of a service is only one of our possible values, it is crucial that it actually reflects what the current status is. In order to avoid detecting random and temporary failures, Nagios uses soft and hard states to describe what the current status is for a host or service.
Imagine that an administrator is restarting a Web server, and this operation makes connecting to the webpages unavailable for 5 seconds. Since such restarts are usually done at night to lower the number of users affected, this is an acceptable period of time. However, a problem might be that Nagios will try to connect to the server and notice it is actually down. If it would only rely on a single result, Nagios could trigger an alert that a Web server is down. It would actually be up and running again in a few seconds, but it could take a couple of minutes for Nagios to find that out.
To handle situations where a service is down for a very short time, or the test has temporarily failed, soft states were introduced. When a previous status of a check is unknown or is different from the previous one, Nagios will re-test the host or service a couple of times to make sure the change is permanent. Nagios assumes that the new result is a soft state. After additional tests have verified that the new state is permanent, it is considered a hard state.
Each host and service check defines the number of retries to perform before assuming a change is permanent. This allows more flexibility over how many failures should be treated as an actual problem instead of a temporary one. Setting the number of checks to 1 will cause all changes to be treated as hard instantly. The following figure is an illustration of soft and hard state changes, assuming that number of checks to be performed is set to 3:
This feature is very useful for checks that should skip short outages of a service or use a protocol that might fail in case of extensive traffic—such as ICMP or UDP. Monitoring devices over SNMP is also an example of a check that can fail in cases where a single check fails; nevertheless, the check will eventually succeed during the second or third check.