Service availability monitoring with Nagios and BPI

Several times, senior management have asked Team Wireless to provide an uptime figure for eduroam. While we do have an awful lot of monitoring of systems and services, it has never been possible to give a single uptime figure because it needs some detailed knowledge to make sense of the many Nagios checks (currently 2704 of them).

From the point of view of a Bristol user on campus here, there are three services that must be up for eduroam to work: RADIUS authentication, DNS, and DHCP. For the purposes of resilience, the RADIUS service for eduroam is provided by 3 servers, DNS by 2 servers and DHCP by 2 servers. It’s hard to see the overall state of the eduroam service from a glance at which systems and services are currently up in Nagios.

Nagios gives us detailed performance monitoring and graphing for each system and service but has no built-in aggregation tools. I decided to use an addon called Business Process Intelligence (BPI) to do the aggregation. We built this as an RPM for easy deployment, and configured it with Puppet.

BPI lets you define meta-services which consist of other services that are currently in Nagios. I defined a BPI service called RADIUS which contains all three RADIUS servers. Any one RADIUS server must be up for the RADIUS group to be up. I did likewise for DNS and DHCP.

BPI also lets meta-services depend on other groups. To consider eduroam to be up, you need the RADIUS group and the DNS group and the DHCP group to be up. It’s probably easier to see what’s going on with a screenshot of the BPI control panel:

BPI control panel

BPI control panel

So far, these BPI meta-services are only visible in the BPI control panel and not in the Nagios interface itself. The BPI project does, however, provide a Nagios plugin check_bpi which allows Nagios to monitor the state of BPI meta-services. As part of that, it will draw you a table of availability data.

eduroam uptime

eduroam uptime

So now we have a definitive uptime figure to the overall eduroam service. How many nines? An infinite number of them! 😉 (Also, I like the fact that “OK” is split into scheduled and unscheduled uptime…)

This availability report is still only visible to Nagios users though. It’s a few clicks deep in the web interface and provides a lot more information than is actually needed. We need a simpler way of obtaining this information.

So I wrote a script called nagios-report which runs on the same host as Nagios and generates custom availability reports with various options for output formatting. As an example:

$ sudo /usr/bin/nagios-report -h bpi -s eduroam -o uptime -v -d
Total uptime percentage for service eduroam during period lastmonth was 100.000%

This can now be run as a cron job to automagically email availability reports to people. The one we were asked to provide is monthly, so this is our crontab entry to generate it on the first day of each month:

# Puppet Name: eduroam-availability
45 6 1 * * nagios-report -h bpi -s eduroam -t lastmonth -o uptime -v -d

It’s great that our work on resilience has paid off. Just last week (during the time covered by the eduroam uptime table) we experienced a temporary loss of about a third of our VMs, and yet users did not see a single second of downtime. That’s what we’re aiming for.

Making suexec work…

suexec is a useful way of getting apache to run interactive magic (cgi scripts, php scripts etc) with a different user/group than the one that apache is running as.

Most configuration guides tell you:

  • Add “SuexecUserGroup $OWNER $GROUP” to your apache config
  • Look in /var/log/httpd/suexec.log to see what’s going wrong

What they don’t tell you, is that suexec makes some assumptions about where to find things it will execute, and that you can’t guarantee that log location is consistent across distros (or even versions of the same distro). I’m setting this up on CentOS 7, so the examples below were produced in that environment.

You can get useful information about both of the above by running the following:

[myuser]$ sudo suexec -V
-D AP_DOC_ROOT="/var/www"
-D AP_GID_MIN=100
-D AP_HTTPD_USER="apache"
-D AP_LOG_SYSLOG
-D AP_SAFE_PATH="/usr/local/bin:/usr/bin:/bin"
-D AP_UID_MIN=500
-D AP_USERDIR_SUFFIX="public_html"
[myuser]$

AP_DOC_ROOT and AP_USERDIR_SUFFIX control which paths suexec will execute. In this case we’re restricted to running stuff that lives somewhere under “/var/www” or “/home/*/public_html”

If you’ve got content elsewhere (for example, an application which expects to be installed under /usr/share/foo/cgi-bin) then it’s not sufficient to put a symlink from /var/www/foo/cgi-bin to /usr/share/foo/cgi-bin as suexec checks the actual location of the file, not where it was called from.

This is sensible, as it stops you putting a symlink in place which points at something nasty like /bin/sh.

AP_GID_MIN and AP_UID_MIN limit which users/groups suexec will run stuff as. In this case it won’t run anything with a GID < 100 or a UID < 500. This is sensible as it stops you running CGI scripts as privileged system users.

The GID limit is probably not an issue, but the UID limit might cause wrinkles if you look after one of the 18 UoB users who have a centrally allocated unix UID that is under 500 (because they’ve been here since before that was a problem)[1]

AP_LOG_SYSLOG is a flag that says "send all log messages to syslog" – which is fine, and arguably an improvement over writing to a specific log file. It doesn’t immediately tell you where those messages end up, but I eventually found them in /var/log/secure… which seems a sensible place for them to end up.

Once you’ve got all that sorted, you’ll need to make selinux happy. Thankfully, that’s dead easy and can be done by enabling the httpd_unified boolean. If you’re using the jfryman/selinux puppet module, it’s as easy as:


selinux::boolean { 'httpd_unified': }

I think that’s all the bumps I’ve hit on this road so far, but if I find any more I’ll update this article.

-Paul
[1] but then again, if you look after them (or one of the 84 users whose UID is under 1000) you’re probably already used to finding odd things that don’t work!