Tuesday, November 1, 2011

Nagios Plugin–Advance Traceroute to check between two devices

We had to create a plugin to basically do the following
1) Do a typical traceroute from the Nagios box to a destination IP
2) Instead of calculating the time between the Nagios to Destination Host, we are interested to know the time between two host in between

In other words, a typical traceroute will
NagionServer –> Gateway –> Hop 1 –> Hop 2 –> Hop 3 –> Destination

What this plugin can do is when defined correctly, to check the time (in ms) between Hop 1 up until Hop 3, plot a graph and put up warning and critical values for your alerting.
Here’s the sample plugin, and relevant configuration files you probably need.
NOTE: You may need to tweak for different  Oses other than Debian as this was created and tested with a Debian.
The plugin
  • The plugin (place typically in /usr/local/nagios/libexec)
  • Paste below into a file say trace_time
  • Make sure it belongs to user <nagios> and has execution right; e.g.
  • chown nagios:nagios /usr/local/nagios/libexec/trace_time
  • chmod +X /usr/local/nagios/libexec/trace_time
#####START PLUGIN#####
# usage
# ./trace-time <final-dest> <startip> <endip> <warning> <critical>
# Note: You must define all three, there's no error checking
# tip: do a traceroute first, then determine from which ip to which ip do you want to calculate. If
PROG=`which traceroute`
if [[ $DEST == "" ]]; then
    echo "UNKNOWN: No destination ip defined"
    exit 3

if [[ $IP1 == "" ]]; then
        echo "UNKNOWN: No start ip defined"
    exit 3

if [[ $IP2 == "" ]]; then
if [[ $WARNING -eq "" ]]; then
        echo "UNKNOWN: No warning value defined"
        exit 3
if [[ $CRITICAL == "" ]]; then
        echo "UNKNOWN: No critical value defined"
        exit 3

if [[ $WARNING  >  $CRITICAL ]]; then
        echo "UNKNOWN: Warning value larger than critical value"
        exit 3
myepoch=`date +%s`
/bin/touch $filename
/bin/touch $tempfile
/bin/chown nagios:nagios $filename
/bin/chown nagios:nagios $tempfile
getreading=`$PROG -n -q 1 $DEST > $tempfile`
numberip1=`cat $tempfile | grep ms | grep $IP1 | awk {'print $1'}`
numberip2=`cat $tempfile | grep ms | grep $IP2 | awk {'print $1'}`
for i in $(seq $numberip1 $numberip2)
    getms=`cat $tempfile | sed -e 's/^[ \t]*//' | grep ^$i |  awk {'print $3'}`
    echo $getms >> $filename
startcalc=`awk '{s+=$0} END {print s}' $filename`
rm $filename
rm $tempfile
if awk 'BEGIN{if(0+'$startcalc'>'$CRITICAL'+0)exit 0;exit 1}'
        echo "CRITICAL($startcalc): Time exceed critical value|$grapher=$startcalc;$WARNING;$CRITICAL"
        exit 2
if awk 'BEGIN{if(0+'$startcalc'>'$WARNING'+0)exit 0;exit 1}'
        echo "WARNING($startcalc): Time exceed warning value|$grapher=$startcalc;$WARNING;$CRITICAL"
        exit 1
        echo "OK($startcalc): Time OK|'$grapher'=$startcalc;$WARNING;$CRITICAL;;"
        exit 0
#####END PLUGIN#####

Nagios – Host.cfg
define host{
        use                     debian5-linuxserver
        host_name     Google WWW server
        alias                   For Tracing TimeHop Distances

Nagios – commands.cfg
define command{
        command_name    check_time_between_hosts
        command_line    $USER1$/trace-time $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$

Nagios – services.cfg
define service{
        use                                       debian5-linuxservice
        host_name                       Google WWW server
        service_description      Between IP to
        action_url                          /nagios/pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$
        check_command            check_time_between_hosts!!!10!20
* Note, the template debian5-linuxservice and debian5-linuxserver is not default and you need to define one first or use the defaults
Now, just restart Nagios to make it work.

More info
In order for you to know the hop you wish to monitor, simply do a traceroute;
# traceroute -n -q 1
-n = Numeric output
- q 1= Only do a single query
In this example below, I am tracing to one of Google’s servers at, the output of the trace is like below (NOTE!: actual IPs have been changed)
1  0.554 ms
2  0.667 ms
3  1.026 ms
4  1.218 ms
5  1.488 ms
6  1.627 ms
7  1.542 ms
8  2.322 ms
9  3.075 ms
10  2.801 ms

So lets say you wish to trace the time between IP and IP113.23.161.66, simply use the plugin with these values on the CLI (to test);
# ./trace-time 10 20
And the output will look like this;
OK(5.909): Time OK|'>'=5.909;10;20;;
*Which is a typical output expected by Nagios with PNP graphing enabled
Graphs will look like this