Nagios – aggregating performance data


The Wireless team use Nagios to monitor our servers. As well as availability monitoring, we use pnp4nagios to collect and graph performance data. This works reasonably well for us, and we can easily draw graphs of everything from CPU temperature to how many queries/second our mariadb servers are handling.

However, the graphs are drawn on a per-host basis, which wasn’t a problem until now…

Like a lot of people at UoB, we’re migrating services to the f5 load balancers so that we can scale them out as we need to. Services which were previously single hosted are now fronted by several servers in a load balanced configuration.

It would be nice to be able to combine performance data from multiple nodes so we can get a picture of how many queries/second the entire pool is handling. As I’ve written about previously, this sort of information is very useful when it comes to capacity planning.

The f5 will tell us how many tcp/udp connections it’s handling for that pool, and the amount of traffic, but that’s not quite the same thing as the number of queries. Nagios has that information, it just can’t graph it easily.

I had a look around at a few nagios plugins that claimed to solve this problem. The best one I could find looked difficult to deploy without dragging in more dependencies than we wanted to maintain on a production box. It’s licence wasn’t particularly conducive to hacking it about to make it deployable in our environment, so I wrote my own from scratch.

It’s available from here:

The plugin works by scanning through the status.dat file on the nagios server itself, summarizing the checks/hosts which match. It then reports the sum (or average if that’s what you prefer) as perfdata for nagios to graph.

If you think it might be useful to you, please use it! If you spot something it doesn’t do (or doesn’t do as well as you like) we’re more than happy to accept pull requests or issues logged through github.