Book Image

Zabbix Network Monitoring - Second Edition

By : Rihards Olups, Rihards Olups, Rihards Olups
Book Image

Zabbix Network Monitoring - Second Edition

By: Rihards Olups, Rihards Olups, Rihards Olups

Overview of this book

This book is a perfect starting point for monitoring with Zabbix. Even if you have never used a monitoring solution before, this book will get you up and running quickly, before guiding you into more sophisticated operations with ease. You'll soon feel in complete control of your network, ready to meet any challenges you might face. Beginning with installation, you'll learn the basics of data collection before diving deeper to get to grips with native Zabbix agents and SNMP devices. You will also explore Zabbix's integrated functionality for monitoring Java application servers and VMware. Beyond this, Zabbix Network Monitoring also covers notifications, permission management, system maintenance, and troubleshooting - so you can be confident that every potential challenge and task is under your control. If you're working with larger environments, you'll also be able to find out more about distributed data collection using Zabbix proxies. Once you're confident and ready to put these concepts into practice, you'll find out how to optimize and improve performance. Troubleshooting network issues is vital for anyone working with Zabbix, so the book is also on hand to help you work through any technical snags and glitches you might face. Network monitoring doesn't have to be a chore - learn the tricks of the Zabbix trade and make sure you're network is performing for everyone who depends upon it.
Table of Contents (32 chapters)
Zabbix Network Monitoring Second Edition
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
6
Detecting Problems with Triggers
7
Acting upon Monitored Conditions
Index

Troubleshooting Zabbix


All of the previous Q&As cover a few of the most common issues new users might encounter. There are a lot of other issues one might run into, and with new versions of Zabbix, new issues will appear. While it's good to have quick solutions to common problems, let's look at some details that could be helpful when debugging Zabbix problems.

The Zabbix log file format

One of the first places we should check when there's an unexplained issue is log files. This is not just a Zabbix-specific thing; log files are great. Sometimes. Other times, they do not help, but we will discuss some other options for when log files do not provide the answer. To be able to find the answer, though, it is helpful to know some basics about the log file format. The Zabbix log format is as follows:

PPPPPP:YYYYMMDD:HHMMSS.mmm

Here, PPPPPP is process ID, space-padded to 6 characters, YYYYMMDD is the current date, HHMMSS is the current time, and mmm is milliseconds for the timestamp. Colons and the dot are literal symbols. This prefix is followed by a space and then by the actual log message. Here's an example log entry:

10372:20151223:134406.865 database is down: reconnecting in 10 seconds

If there's a line in the log file without this prefix, it is most likely coming from an external source such as a script, or maybe from some library such as Net-SNMP.

During startup, output similar to the following will be logged:

12331:20151215:163629.968 Starting Zabbix Server. Zabbix 3.0.0 (revision {ZABBIX_REVISION}).
12331:20151215:163630.020 ****** Enabled features ******
12331:20151215:163630.020 SNMP monitoring:           YES
12331:20151215:163630.020 IPMI monitoring:           YES
12331:20151215:163630.020 Web monitoring:            YES
12331:20151215:163630.020 VMware monitoring:         YES
12331:20151215:163630.020 SMTP authentication:       YES
12331:20151215:163630.020 Jabber notifications:       NO
12331:20151215:163630.020 Ez Texting notifications:  YES
12331:20151215:163630.020 ODBC:                      YES
12331:20151215:163630.020 SSH2 support:              YES
12331:20151215:163630.020 IPv6 support:               NO
12331:20151215:163630.020 TLS support:                NO
12331:20151215:163630.020 ******************************
12331:20151215:163630.020 using configuration file: /usr/local/etc/zabbix_server.conf
12331:20151215:163630.067 current database version (mandatory/optional): 03000000/03000000
12331:20151215:163630.067 required mandatory version: 03000000

The first line prints out the daemon type and version. Depending on how it was compiled, it might also include the current SVN revision number. A list of the compiled-in features follows—very useful to know whether you should expect SNMP, IPMI, or VMware monitoring to work at all. Then, the path to the currently used configuration file is shown—helpful when we want to figure out whether the file we changed was the correct one. In the server and proxy log files, both the current and the required database versions are present—we discussed those in Chapter 22, Zabbix Maintenance.

After the database versions, the internal process startup messages can be found:

2583:20151231:155712.323 server #0 started [main process]
2592:20151231:155712.334 server #5 started [poller #3]
2591:20151231:155712.336 server #4 started [poller #2]
2590:20151231:155712.337 server #3 started [poller #1]
2593:20151231:155712.339 server #6 started [poller #4]

There will be many more lines like these; the output here is trimmed. This might help verify that the expected number of processes of some type has been started. When looking at log file contents, it is not always obvious which process logged a specific line—and this is where the startup messages can help. If we see a line such as the following, we can find out which process logged it:

21974:20151231:184520.117 Zabbix agent item "vfs.fs.size[/,free]" on host "A test host" failed: another network error, wait for 15 seconds

We can do that by looking for the startup message with the same PID:

# grep 21974 zabbix_server.log | grep started
21974:20151231:184352.921 server #8 started [unreachable poller #1]

Tip

If more than one line is returned, apply common sense to find out the startup message.

This demonstrates that hosts are deferred to the unreachable poller after the first network failure.

But what if the log file has been rotated and the original startup messages are lost? Besides more advanced detective work, there's a simple method, provided that the daemon is still running. We will look at that method a bit later in this chapter.

Reloading the configuration cache

We met the configuration cache in Chapter 2, Getting Your First Notification, and we discussed ways to monitor it in Chapter 22, Zabbix Maintenance. While it helps a lot performance-wise, it can be a bit of a problem if we are trying to quickly test something. It is possible to force Zabbix server to reload the configuration cache. Run the following to display Zabbix server options:

# zabbix_server --help

Tip

We briefly discussed Zabbix proxy configuration cache reloading in Chapter 19, Using Proxies to Monitor Remote Locations.

In the output, look for the runtime control options section:

-R --runtime-control runtime-option  Perform administrative functions  Runtime control options:config_cache_reload  Reload configuration cache

Thus, reloading the server configuration cache can be initiated by the following:

# zabbix_server --runtime-control config_cache_reload
zabbix_server [2682]: command sent successfully

Examining the server log file will reveal that it has received the signal:

forced reloading of the configuration cache

In the background, the sending of the signal happens like this:

  1. The server binary looks up the default configuration file.

  2. It then looks for the file specified in the PidFile option.

  3. It sends the signal to the process with that ID.

As discussed in Chapter 19, Using Proxies to Monitor Remote Locations, the great thing with this feature is that it's also supported for active Zabbix proxies. Even better, when an active proxy is instructed to reload its configuration cache, it connects to the Zabbix server, gets all the latest configuration, and then reloads the local configuration cache. If such a signal is sent to a passive proxy, it ignores the signal.

What if you have several proxies running on the same system—how can you tell the binary which exact instance should reload the configuration cache? Looking back at the steps that were taken to deliver the signal to the process, all that is needed is specifying the correct configuration file. If running several proxies on the same system, each must have its own configuration file already, specifying different PID files, log files, listening ports, and so on. Instructing a proxy that used a specific configuration file to reload the configuration cache would be this simple:

# zabbix_proxy -c /path/to/zabbix_proxy.conf --runtime-control config_cache_reload

Note

The full or absolute path must be provided for the configuration file; a relative path is not supported.

Tip

The same principle applies for servers and proxies, but it is even less common to run several Zabbix servers on the same system.

Manually reloading the configuration cache is useful if we have a large Zabbix server instance and have significantly increased the CacheUpdateFrequency parameter.

Controlling running daemons

A configuration cache reload was only one of the things available in the runtime section. Let's look at the remaining options in there:

housekeeper_execute        Execute the housekeeper 
log_level_increase=target  Increase log level, affects all processes if target is not specified 
log_level_decrease=target  Decrease log level, affects all processes if target is not specified 
Log level control targets: pid 
Process identifier process-type All processes of specified type (for example, poller)
process-type,N           Process type and number (e.g., poller,3)

As discussed in Chapter 22, Zabbix Maintenance, the internal housekeeper is first run 30 minutes after the server or proxy startup. The housekeeper_execute runtime option allows us to run it at will:

# zabbix_server --runtime-control housekeeper_execute

Even more interesting is the ability to change the log level for a running process. This feature first appeared in Zabbix 2.4, and it made debugging much, much easier. Zabbix daemons are usually started and just work—until we have to change something. While we cannot tell any of the daemons to re-read their configuration file, there are a few more options that allow us to control some aspects of a running daemon. As briefly mentioned in Chapter 22, Zabbix Maintenance, the DebugLevel parameter allows us to set the log level when the daemon starts, with the default being 3. Log level 4 adds all the SQL queries, and log level 5 also adds the received content from web monitoring and VMware monitoring. For the uninitiated, anything above level 3 can be very surprising and intimidating. Even a very small Zabbix server can easily log tens of megabytes in a few minutes at log level 4. As some problem might not appear immediately, one might have to run it for hours or days at log level 4 or 5. Imagine dealing with gigabytes of logs you are not familiar with. The ability to set the log level for a running process allows us to increase the log level during a problem situation and lower it later, without requiring a daemon restart.

Even better, when using the runtime log level feature, we can select which exact components should have their log level changed. Individual processes can be identified by either their system PID or by the process number inside Zabbix. Specifying processes by the system PID could be done like this:

# zabbix_server --runtime-control log_level_increase=1313

Specifying an individual Zabbix process is done by choosing the process type and then passing the process number:

# zabbix_server --runtime-control log_level_increase=trapper,3

A fairly useful and common approach is changing the log level for all processes of a certain type—for example, we don't know which trapper will receive the connection that causes the problem, so we could easily increase the log level for all trappers by omitting the process number:

# zabbix_server --runtime-control log_level_increase=trapper

And if no parameter is passed to this runtime option, it will affect all Zabbix processes:

# zabbix_server --runtime-control log_level_increase

When processes are told to change their log level, they log an entry about it and then change the log level:

21975:20151231:190556.881 log level has been increased to 4 (debug)

Note that there is no way to query the current log level or set a specific level. If you are not sure about the current log level of all the processes, there are two ways to sort it out:

  • Restart the daemon

  • Decrease or increase the log level five times so that it's guaranteed to be at 0 or 5, then set the desired level

As a simple test of the options we just explored, increase the log level for all pollers:

# zabbix_server --runtime-control log_level_increase=poller

Follow the Zabbix server logfile:

# tail -f /tmp/zabbix_server.log

Notice the amount of data just 5 poller processes on a tiny Zabbix server can generate. Then decrease the log level:

# zabbix_server --runtime-control log_level_decrease=poller

Runtime process status

Zabbix has another small trick to help with debugging. Run top and see which mode gives you a more stable and longer list of Zabbix processes—one of sorting by processor usage (hitting Shift + P) or memory usage (hitting Shift + M) might.

Tip

Alternatively, hit o and type COMMAND=zabbix_server.

Press C and notice how Zabbix processes have updated their command line to show which exact internal process it is and what is it doing:

zabbix_server: poller #1 [got 0 values in 0.000005 sec, idle 1 sec]
zabbix_server: poller #4 [got 1 values in 0.000089 sec, idle 1 sec]
zabbix_server: poller #5 [got 0 values in 0.000004 sec, idle 1 sec]

Follow their status and see how the task and the time it takes changes for some of the processes. We could also have output that could be redirected or filtered through other commands:

# top -c -b | grep zabbix_server

The -c option tells it to show the command line, the same thing we achieved by hitting C before. The -b option tells top to run in batch mode without accepting input and just outputting the results. We could also specify -n 1 to run it only once or specify any other number as needed.

It might be more convenient to use ps:

# ps -f -C zabbix_server

The -f flag enables full output, which includes the command line. The -C flag filters by the executable name:

zabbix   21969 21962  0 18:43 ?        00:00:00 zabbix_server: poller #1 [got 0 values in 0.000006 sec, idle 1 sec] 
zabbix   21970 21962  0 18:43 ?        00:00:00 zabbix_server: poller #2 [got 0 values in 0.000008 sec, idle 1 sec] 
zabbix   21971 21962  0 18:43 ?        00:00:00 zabbix_server: poller #3 [got 0 values in 0.000004 sec, idle 1 sec]

The full format prints out some extra columns—if all we needed was the PID and the command line, we could limit columns in the output with the -o flag, like this:

# ps -o pid=,command= -C zabbix_server
21975 zabbix_server: trapper #1 [processed data in 0.000150 sec, waiting for connection]
21976 zabbix_server: trapper #2 [processed data in 0.001312 sec, waiting for connection]

Tip

The equals sign after pid and command tells ps not to use any header for these columns.

And to see a dynamic list that shows the current status, we can use the watch command:

# watch -n 1 'ps -o pid=,command= -C zabbix_server'

This list will be updated every second. Note that the interval parameter -n also accepts decimals, so to update twice every second, we could use -n 0.5.

This is also the method to find out which PID corresponds to which process type if startup messages are not available in the log file—we can see the process type and PID in the output of top or ps.

Further debugging

There are a lot of things that could go wrong, and a lot of tools to help with finding out why. If you are familiar with the toolbox , including tools such as tcpdump, strace, ltrace, and pmap, you should be able to figure out most Zabbix problems.

Tip

Some people claim that everything is a DNS problem. Often, they are right—if nothing else helps, check the DNS. Just in case.

We won't discuss them here, as it would be quite out of scope—that's general Linux or Unix debugging. Of course, there's still a lot of Zabbix-specific things that could go wrong. You might want to check out the Zabbix troubleshooting page on the wiki: http://zabbix.org/wiki/Troubleshooting. If that does not help, make sure to check the community and commercial support options, such as the Zabbix IRC channel we will discuss in Appendix B, Being Part of the Community.