Troubleshooting the Nagios plugin execution environment can be quite tricky. Normally the method for troubleshooting a check is to execute that check from the command line via sudo. However, this is sometimes not enough, particularly when the plugin relies on a specific set of environment variables, and those variables aren’t set properly, or aren’t set for the user executing the plugin.
I recently upgraded the Oracle client on a Nagios host. The upgrade was prompted by other service requirements that share that host – we needed the newest oci8 for PHP. As a result of an oversight, I forgot to restart Nagios at the time of the upgrade.
Several weeks later, I upgraded the memory in the Oracle DBs – which prompted a restart of Nagios.
All the local Oracle monitors broke.
Needless to say, this caused a bit of concern, given I’d just touched a bunch of Oracle, and now Nagis is yelling that “all the Oracles are dead!”
The first problem was that I’d replaced the old client with the Instant Client – which is a no-no given the InstantClient doesn’t include TNSPING. I backed out the InstantClient installation, and installed the newest full Client from source.
All the local Oracle monitors were still broken. Huh?
If I set the checks to a passive OK value, and they would return to CRITICAL during the next round.
Now I had to troubleshoot Nagios plugin execution in earnest. Running the check commands on the command line, as Nagios, worked just fine:
rbellamy@admin ~ $ sudo -u nagios /usr/local/nagios/libexec/check_oracle –tns 10.0.xxx.xxx
OK – reply time 0 msec from 10.0.xxx.xxx
Trying to track down the plugin execution inside Nagios, using the normal tools, didn’t yield anything actionable. Setting the Debug Log Level to -1 in nagios.cfg, and monitoring the logs, didn’t give me much insight – everything looked like it was behaving itself. I think this is mostly a function of the check_oracle plugin swallowing errors and giving only sanitized output. Basically check_oracle tests for success or failure, and if it fails sets the return value accordingly and sends a generic failure status to stdout, and doesn’t ‘leak’ anything via stderr.
So, then I altered the check_oracle command definition in checkcommands.cfg to use the shell debug switch:
# ‘debug_check_oracle’ command definition
command_line sh -x $USER1$/check_oracle $ARG1$ $ARG2$
And then I created an appropriate entry in my org.cfg file to check for a TNS Listener, and a Dummy Login.
Now I could tell that check_oracle couldn’t find TNSPING or SQLPLUS (for the –tns and –login checks, respectively). I reasoned that something was wrong with $PATH – namely it was the default path set for all shells, without regard for /etc/profile or /etc/bashrc. However, I couldn’t say this was the case with certainty, and given I’d already spent a considerable amount of time tracking this problem down by the hunt-and-guess method, I had no patience for anything less than a declarative troubleshooting session.
I knew I was on the right track, but was confounded by the difficulty in finding the execution environment under which the service check was running. I did some further research and happened on this little gem called capture_plugin.pl.
As its title suggests, and the author notes:
This little perl-scripts (sic) captures STDOUT and STDERR of a nagios plugin. It writes the captured output to a temp. (sic) file. Then the output is returned to Nagios for further processing.
Although this wonderfully simple piece of perl is very helpful just as it is, it wasn’t quite enough for me since I needed to know what the execution environment looked like during the execution of the plugin. So went about mucking with this plugin, and finally came up with something that allowed me to get a snapshot of the environment variables before and after.
You can find my modified version of the file here. Just set things up so you’ve got a command defined for the specific plugin you’re capturing, in my case I called it capture_check_oracle:
# Used for debugging.
# Found at http://www.waggy.at/nagios/capture_plugin.txt
# G. Richard Bellamy <firstname.lastname@example.org>
# ‘capture_plugin’ command definition
command_line $USER1$/capture_plugin $USER1$/check_oracle $ARG1$ $ARG2$
And then I created a check duplicating one of the checks that was failing (the most important part is in bold):
service_description Capture Oracle 10.0.xxx.xxx Login
The log is output to /tmp/captured-plugins.log, and gives you nice data about the return code, stdout and stderr of your plugin. The first thing I noticed was that in fact I was correct, $PATH had the default values, and that the rest of the environment was completely missing.
On inspection, it appeared $USER was equal to “root”; the plugin was running as a completely different user than I had expected, namely root. Once again: huh? This is contrary to everything I understood about Nagios, and all the documentation. In fact, I’d always thought you had to force a plugin to run as root, via setuid or some other check-specific or plugin-specific construct to elevate to that user. Nagios runs as the nagios user!!!!
I”ve asked around on the IRC channel, and done some more reading, without much luck. So my solution is to modify the init script and execute the Nagios binary as the Nagios user:
#$NagiosBin -d $NagiosCfgFile <== this is the default – I don’t use this anymore
su – $NagiosUser -c “$NagiosBin -d $NagiosCfgFile” # here we go, this does the trick.
It’s working now, and I’m continuing my investigations to see why this happened, why this fixed it, and how Nagios should be launched so that this doesn’t happen again. Until I learn otherwise, I’ll be explicitly launching the Nagios binary as the Nagios user.