Service availability monitoring with Nagios and BPI

Several times, senior management have asked Team Wireless to provide an uptime figure for eduroam. While we do have an awful lot of monitoring of systems and services, it has never been possible to give a single uptime figure because it needs some detailed knowledge to make sense of the many Nagios checks (currently 2704 of them).

From the point of view of a Bristol user on campus here, there are three services that must be up for eduroam to work: RADIUS authentication, DNS, and DHCP. For the purposes of resilience, the RADIUS service for eduroam is provided by 3 servers, DNS by 2 servers and DHCP by 2 servers. It’s hard to see the overall state of the eduroam service from a glance at which systems and services are currently up in Nagios.

Nagios gives us detailed performance monitoring and graphing for each system and service but has no built-in aggregation tools. I decided to use an addon called Business Process Intelligence (BPI) to do the aggregation. We built this as an RPM for easy deployment, and configured it with Puppet.

BPI lets you define meta-services which consist of other services that are currently in Nagios. I defined a BPI service called RADIUS which contains all three RADIUS servers. Any one RADIUS server must be up for the RADIUS group to be up. I did likewise for DNS and DHCP.

BPI also lets meta-services depend on other groups. To consider eduroam to be up, you need the RADIUS group and the DNS group and the DHCP group to be up. It’s probably easier to see what’s going on with a screenshot of the BPI control panel:

BPI control panel

BPI control panel

So far, these BPI meta-services are only visible in the BPI control panel and not in the Nagios interface itself. The BPI project does, however, provide a Nagios plugin check_bpi which allows Nagios to monitor the state of BPI meta-services. As part of that, it will draw you a table of availability data.

eduroam uptime

eduroam uptime

So now we have a definitive uptime figure to the overall eduroam service. How many nines? An infinite number of them! 😉 (Also, I like the fact that “OK” is split into scheduled and unscheduled uptime…)

This availability report is still only visible to Nagios users though. It’s a few clicks deep in the web interface and provides a lot more information than is actually needed. We need a simpler way of obtaining this information.

So I wrote a script called nagios-report which runs on the same host as Nagios and generates custom availability reports with various options for output formatting. As an example:

$ sudo /usr/bin/nagios-report -h bpi -s eduroam -o uptime -v -d
Total uptime percentage for service eduroam during period lastmonth was 100.000%

This can now be run as a cron job to automagically email availability reports to people. The one we were asked to provide is monthly, so this is our crontab entry to generate it on the first day of each month:

# Puppet Name: eduroam-availability
45 6 1 * * nagios-report -h bpi -s eduroam -t lastmonth -o uptime -v -d

It’s great that our work on resilience has paid off. Just last week (during the time covered by the eduroam uptime table) we experienced a temporary loss of about a third of our VMs, and yet users did not see a single second of downtime. That’s what we’re aiming for.

Merging SELinux policies

We make extensive use of SELinux on all our systems. We manage SELinux config and policy with the jfryman/selinux Puppet module, which means we store SELinux policies in plain text .te format – the same format that audit2allow generates them in.

One of our SELinux policies that covers permissions for NRPE is a large file. When we generate new rules (e.g. for new Nagios plugins) with audit2allow it’s a tedious process to merge the new rules in by hand and mistakes are easy to make.

So I wrote semerge – a tool to merge SELinux policy files with the ability to mix and match stdin/stdout and reading/writing files.

This example accepts input from audit2allow and merges the new rules into an existing policy:

cat /var/log/audit/audit.log | audit2allow | semerge -i existingpolicy.pp -o existingpolicy.pp

And this example deduplicates and alphabetises an existing policy:

semerge -i existingpolicy.pp -o existingpolicy.pp

There are probably bugs so please do let me know if you find it useful and log an issue if you run into problems.

One year of ResNet Gitlab

Today, it has been one year since the first Merge Request (MR) was created and accepted by ResNet* Gitlab. During that time, about 250 working days, we have processed 462 MRs as part of our Puppet workflow. That’s almost two a day!

We introduced Git and Gitlab into our workflow to replace the ageing svn component which didn’t handle branching and merging well at all. Jumping to Git’s versatile branching model and more recently adding r10k into the mix has made it trivially easy to spin up ephemeral dev environments to work on features and fixes, and then to test and release them into the production environment safely.

We honestly can’t work out how on earth we used to cope without such a cool workflow.

Happy Birthday, ResNet Gitlab!

* 1990s ResNet brand for historical reasons only – this Gitlab installation is used mostly for managing eduroam and DNS. Maybe NetOps would have been a better name 🙂

Rethinking the wireless database architecture

The eduroam wireless network has a reliance on a database for the authorization and accounting parts of AAA (authentication, authorization and accounting – are you who you say you are, what access are you allowed, and what did you do while connected).

When we started dabbling with database-backed AAA in 2007 or so, we used a centrally-provided Oracle database. The volume of AAA traffic was low and high performance was not necessary. However (spoiler alert) demand for wireless connectivity grew and before many months, we were placing more demand on Oracle than it could handle. The latency of our queries was taking sufficiently long that some wireless authentication requests would time out and fail.

First gen – MySQL (2007)

It was clear that we needed a dedicated database platform, and at the time that we asked, the DBAs were not able to provide a suitable platform. We went down the route of implementing our own. We decided to use MySQL as a low-complexity open source database server with a large community. The first iteration of the eduroam database hardware was a single second-hand server that was going spare. It had no resilience but was suitably snappy for our needs.

First gen database

First gen database

Second gen – MySQL MMM (2011)

Demand continued to grow but more crucially eduroam went from being a beta service that was “not to be relied upon” to being a core service that users routinely used for their teaching, learning and research. Clearly a cobbled-together solution was no longer fit for purpose, so we went about designing a new database platform.

The two key requirements were high query capacity, and high availability, i.e. resilience against the failure of an individual node. At the time, none of the open source database servers had proper clustering – only master-slave replication. We installed a clustering wrapper for MySQL, called MMM (MySQL Multi Master). This gave us a resilient two-node cluster whether either node could be queried for reads and one node was designated the “writer” at any one time. In the event of a node failure, the writer role would be automatically moved around by the supervisor.

Second gen database

Second gen database

Not only did this buy us resilience against hardware faults, for the first time it also allowed us to drop either node out of the cluster for patching and maintenance during the working day without affecting service for users.

The two-node MMM system served us well for several years, until the hardware came to its natural end of life. The size of the dataset had grown and exceeded the size of the servers’ memory (the 8GB that seemed generous in 2011 didn’t really go so far in 2015) meaning that some queries were quite slow. By this time, MMM had been discontinued so we set out to investigate other forms of clustering.

Third gen – MariaDB Galera (2015)

MySQL had been forked into MariaDB which was becoming the default open source database, replacing MySQL while retaining full compatibility. MariaDB came with an integrated clustering driver called Galera which was getting lots of attention online. Even the developer of MMM recommended using MariaDB Galera.

MariaDB Galera has no concept of “master” or “slave” – all the nodes are masters and are considered equal. Read and write queries can be sent to any of the nodes at will. For this reason, it is strongly recommended to have an odd number of nodes, so if a cluster has a conflict or goes split-brain, the nodes will vote on who is the “odd one out”. This node will then be forced to resync.

This approach lends itself naturally to load-balancing. After talking to Netcomms about the options, we placed all three MariaDB Galera nodes behind the F5 load balancer. This allows us to use one single IP address for the whole cluster, and the F5 will direct queries to the most appropriate backend node. We configured a probe so the F5 is aware of the state of the nodes, and will not direct queries to a node that is too busy, out of sync, or offline.

Third gen database

Third gen database

Having three nodes that can be simultaneously queried gives us an unprecedented capacity which allows us to easily meet the demands of eduroam AAA today, with plenty of spare capacity for tomorrow. We are receiving more queries per second than ever before (240 per second, and we are currently in the summer vacation!).

We are required to keep eduroam accounting data for between 3 and 6 months – this means a large dataset. While disk is cheap these days and you can store an awful lot of data, you also need a lot of memory to hold the dataset twice over, for UPDATE operations which require duplicating a table in memory, making changes, merging the two copies back and syncing to disk. The new MariaDB Galera nodes have 192GB memory each while the size of the dataset is about 30GB. That should keep us going… for now.

Integrating Puppet and Git

Introduction

One of the main benefits of working with Git is the ability to split code into branches and work on those branches in parallel. By integrating your Git branches into your Puppet workflow, you can present each Git branch as a Puppet environment to separate your dev, test and prod environments in different code branches.

This guide assumes you already have a working Puppet master which uses directory environments, and a working Gitlab install. It assumes that your Puppet and Gitlab instances are on different servers. It’s recommended that you install Gitlab with Puppet as I described in a previous blog post.

Overview

The basic principle is to sync all Puppet code from Gitlab to Puppet, but it’s a little more complex than this. When code is pushed to any branch in Git, a post-receive hook is executed on the Gitlab server. This connects via ssh to the Puppet master and executes another script which checks out the code from the specific Git branch into a directory environment of the same name.

Deploy keys

First you need to create a deploy key for your Puppet Git repo. This grants the Puppet master read-only access to your code. Follow these instructions to create a deploy key. You’ll need to put the deploy key in the home directory of the puppet user on your Puppet master, which is usually /var/lib/puppet/.ssh/id_rsa.pub.

Puppet-sync

The puppet-sync script is the component on the Puppet master. This is available from Github and can be installed verbatim into any path on your system. I chose /usr/local/bin.

SSH keys

You’ll need to create a pair of SSH public/private keys to allow your Gitlab server to ssh to your Puppet server. This is a pretty standard thing to do, but instructions are available on the README for puppet-sync.

Post-receive hook

The post-receive hook must be installed on the Gitlab server. The source code is available from Github. You will need to modify the hook to point at the right locations. The following values are sensible defaults.

REPO="git@gitlab.yoursite.com:namespace/puppet.git"
DEPLOY="/etc/puppet/environments"
SSH_ARGS="-i /var/opt/gitlab/.ssh/git_id_rsa"
PUPPETMASTER="puppet@puppet.yoursite.com"
SYNC_COMMAND="/usr/local/bin/puppet-sync"

If you are using the spuder/gitlab Puppet module to manage your Gitlab instance, installing a custom hook is as simple as doing:

gitlab::custom_hook { 'puppet-post-receive':
  namespace => 'namespace',
  project   => 'puppet',
  type      => 'post-receive',
  source    => 'puppet:///modules/site_gitlab/puppet-post-receive',
}

If you installed Gitlab by hand, you’ll need to do manually install the hook into the hooks subdirectory in your raw Git repo. On our Gitlab installation, this is in /var/opt/gitlab/git-data/repositories/group/puppet.git/hooks

Workflow

Now all the components are in place, you can start to use it (when you’ve tested it). Gitlab’s default branch is called master whereas Puppet’s default environment is production. You need to create a new branch in Gitlab called production, and set this to be the default branch. Delete master.

  1. Make new feature branches on the Gitlab server using the UI
  2. Use git pull and git checkout to fetch these branches onto your working copy
  3. Make changes, commit them, and when you push them, those changes are synced to a Puppet environment with the same name as the branch
  4. Move a dev VM into the new environment by doing puppet agent -t --environment newfeature
  5. When you’re happy, send a merge request to merge your branch back into production. When the merge request is accepted, the code will automatically be pushed to the Puppet server.

Read about this in more detail: Gitlab Flow.

Packaging mod_auth_cas for CentOS 7

Mark wrote a useful post about building mod_auth_cas for CentOS 7. It works, but I prefer to build RPM packages on a build server and deploy them to production servers, rather than building on production servers.

Basically I took the spec file from the mod_auth_cas source package for CentOS 6 from the EPEL 6 repository, tweaked it, and replaced the source tarball with the forked copy Mark recommended. This built cleanly for CentOS 7.

I’ve sent a pull request back to the upstream project with the Red Hat build files and documentation and for those who are keen, here are the EL7 RPM and source RPM:

Update

There’s an even better way of building this for CentOS 7. Fedora 21 includes Apache 2.4 and mod_auth_cas and it is really easy to backport this source package.

First grab the latest version of the source package from the Fedora mirror onto your CentOS 7 build server. Always make sure there isn’t a newer version available in the updates repo.

Rebuild is as simple as issuing:

rpmbuild --rebuild mod_auth_cas-1.0.8.1-11.fc22.src.rpm

It will spit out an RPM file suitable for deployment on CentOS 7. Add --sign if you routinely sign your packages with an RPM-GPG key.

Using Puppet to deploy code from Git

I’ve revisited the way that we at ResNet deploy our web applications to web servers. We decided to store the application code in a Git repository. As part of our release process, we create a tag in Gitlab.

Rather than check the code out manually, we are using a Forge module called puppetlabs/vcsrepo to clone a tagged release and deploy it. Our app repos do not permit anonymous cloning so the Puppet deployment mechanism must be able to authenticate. I found the documentation for puppetlabs/vcsrepo to be a bit lacking and had spend a while figuring out what to do to make it work properly.

I recommend you generate a separate SSH key for each app you want to deploy. I generated my key with ssh-keygen and added it to Gitlab as a deploy key which has read-only access to the repo – no need to make a phantom user.

Here’s a worked example with some extra detail about how to deploy an app from git:

# Define docroot
$docroot = '/var/www/app'

# Deploy SSH key to authenticate git
file { '/etc/pki/id_rsa':
  source => 'puppet:///modules/app/id_rsa',
  owner  => 'root',
  group  => 'root',
  mode   => '0600',
}
file { '/etc/pki/id_rsa.pub':
  source => 'puppet:///modules/app/id_rsa.pub',
  owner  => 'root',
  group  => 'root',
  mode   => '0644',
}

# Clone the app from git
vcsrepo { 'app':
  ensure   => present,
  path     => $docparent,
  provider => git,
  source   => 'git@gitlab.resnet.bris.ac.uk:resnet/app.git',
  identity => '/etc/pki/git_id_rsa',
  revision => '14.0.01',
  owner    => 'apache', # User the local clone will be created as
  group    => 'apache',
  require  => File['/etc/pki/id_rsa', '/etc/pki/id_rsa.pub'],
}

# Configure Apache vhost
apache::vhost { 'app'
  servername    => 'app.resnet.bris.ac.uk',
  docroot       => $docroot,
  require       => Vcsrepo['app'],
  docroot_owner => 'apache',  # Must be the same as 'owner' above
  docroot_group => 'apache',
  ...
}

To deploy a new version of the app, you just need to create a new tagged release of the app in Git and update the revision parameter in your Puppet code. This also gives you easy rollback if you deploy a broken version of your app. But you’d never do that, right? 😉

Publish a Module on Puppet Forge

I’ve started publishing as many of my Puppet modules as possible on Puppet Forge. It isn’t hard to do but there are a few things to know. This guide is largely based on Puppetlabs’ own guide Publishing Modules on the Puppet Forge.

  1. For home-grown modules that have grown organically, you are likely to have at least some site-specific data mixed in with the code. Before publishing, you’ll need to abstract this out. I recommend using parametrised classes with sane defaults for your inputs. If necessary, you can have a local wrapper class to pass site-specific values into your module.
  2. The vast majority of Puppet modules are on GitHub, but this isn’t actually a requirement. GitHub offers public collaboration and issue tracking, but you can keep your code wherever you like.
  3. Before you can publish, you need to include some metadata with your module. Look at the output of puppet module generate. If you’re starting from scratch, this command is an excellent place to start. If you’re patching up an old module for publication, run it in a different location and selectively copy the useful files into your module. The mandatory files are metadata.json and README.md.
  4. When you’re ready to publish, run puppet module build. This creates a tarball of your module and metadata which is ready to upload to Puppet Forge.
  5. Create an account on Puppet Forge and upload your tarball. It will automatically fill in the metadata.
  6. Install your module on your Puppetmaster by doing puppet module install myname/mymodule

Building a Gitlab server with Puppet

GitHub is an excellent tool for code-sharing, but it has the major disadvantage of being fully public. You probably don’t want to put your confidential stuff and shared secrets in there! You can pay for private repositories, but the issue still stands that we shouldn’t be putting confidential UoB things in a non-approved cloud provider.

I briefly investigated several self-hosted pointy-clicky Git interfaces, including Gitorious, Gitolite, GitLab, Phabricator and Stash. They all have their relative merits but they all seem to be a total pain to install and run in a production environment, often requiring that we randomly git clone something into the webroot and then not providing a sane upgrade mechanism. Many of them have dependencies on modules not included with the enterprise Linux distributions

In the end, the easiest-to-deploy option seemed to be to use the GitLab Omnibus installer. This bundles the GitLab application with all its dependencies in a single RPM for ease of deployment. There’s also a Puppet Forge module called spuder/gitlab which makes it nice and easy to install on a Puppet-managed node.

After fiddling, my final solution invokes the Forge module like this:

class { 'gitlab' : 
  puppet_manage_config          => true,
  puppet_manage_backups         => true,
  puppet_manage_packages        => false,
  gitlab_branch                 => '7.4.3',
  external_url                  => "https://${::fqdn}",
  ssl_certificate               => '/etc/gitlab/ssl/gitlab.crt',
  ssl_certificate_key           => '/etc/gitlab/ssl/gitlab.key',
  redirect_http_to_https        => true,
  backup_keep_time              => 5184000, # 5184000 = 60 days
  gitlab_default_projects_limit => 100,
  gitlab_download_link          => 'https://downloads-packages.s3.amazonaws.com/centos-6.5/gitlab-7.4.3_omnibus.5.1.0.ci-1.el6.x86_64.rpm',
  gitlab_email_from             => 'gitlab@example.com',
  ldap_enabled                  => true,
  ldap_host                     => 'ldap.example.com',
  ldap_base                     => 'CN=Users,DC=example,DC=com',
  ldap_port                     => '636',
  ldap_uid                      => 'uid',
  ldap_method                   => 'ssl',
  ldap_bind_dn                  => 'uid=ldapuser,ou=system,dc=example,dc=com',
  ldap_password                 => '*********',
}

I also added a couple of resources to install the certificates and create a firewall exception, to make a complete working deployment.

The upgrade path requires manual intervention, but is mostly automatic. You just need to change gitlab_download_link to point to a newer RPM and change gitlab_branch to match.

If anyone is interested, I’d be happy to write something about the experience of using GitLab after a while, when I’ve found out some of the quirks.

Update by DaveG! (in lieu of comments currently on this site)

Gitlab have changed their install process to require use of their repo, so this module doesn’t like it very much. They’ve also changed the package name to ‘gitlab-ce’ rather than just ‘gitlab’.

To work around this I needed to:

  • Add name => 'gitlab-ce' to the package { 'gitlab': ... } params in gitlab/manifests/install.pp
  • Find the package RPM for a new shiny version of Gitlab. 7.11.4 in this case, via https://packages.gitlab.com/gitlab/gitlab-ce?filter=rpms
  • Copy the RPM to a local web-accessible location as a mirror, and use this as the location for the gitlab_download_link class parameter

This seems to have allowed it to work fine!
(Caveat: I had some strange behaviour with whether it would run the gitlab instance correctly, but I’m not sure if that’s because of left-overs from a previous install attempt. Needs more testing!)

Nagios alerts via Google Talk

Every server built by my team in the last 3 years gets automatically added to our Nagios monitoring. We monitor the heck out of everything, and Nagios is configured to send us an alert email when something needs our attention.

Recently we experienced an issue with some servers which are used to authenticate a large volume of users on a high profile service, nagios spotted the issue and sent us an email – which was then delayed and didn’t actually hit our inboxes until 1.5hrs after the fault developed. That’s 1.5hrs of downtime on a service, in the middle of the day, because our alerting process failed us.

This is clearly Not Good Enough(tm) so we’re stepping it up a notch.

With help from David Gardner we’ve added Google Talk alerts to our Nagios instance. Now when something goes wrong, as well as getting an email about it, my instant messenger client goes “bong!” and starts flashing. Win!

If you squint at it in the right way, on a good day, Google Talk is pretty much Jabber or XMPP and that appears to be fairly easy to automate.

To do this, you need three things things:

  1. An account for Nagios to send the alerts from
  2. A script which sends the alerts
  3. Some Nagios config to tie it all together

XMPP Account

We created a free gmail.com account for this step. If you prefer, you could use another provider such as jabber.org. I would strongly recommend you don’t re-use any account you use for email or anything. Keep it self contained (and certainly don’t use your UoB gmail account!)

Once you’ve created that account, sign in to it and send an invite to the accounts you want the alerts to be sent to. This establishes a trust relationship between the accounts so the messages get through.

The Script

This is a modified version of a perl script David Gardner helped us out with. It uses the Net::XMPP perl module to send the messages. I prefer to install my perl modules via yum or apt (so that they get updated like everything else) but you can install it from CPAN if you prefer. Whatever floats your boat. On CentOS it’s the perl-Net-XMPP package, and on Debian/Ubuntu it’s libnet-xmpp-perl

#!/usr/bin/perl -T
use strict;
use warnings;
use Net::XMPP;
use Getopt::Std;

my %opts;
getopts('f:p:r:t:m:', \%opts);

my $from = $opts{f} or usage();
my $password = $opts{p} or usage();
my $resource = $opts{r} || "nagios";
my $recipients = $opts{t} or usage();
my $message = $opts{m} or usage();

unless ($from =~ m/(.*)@(.*)/gi) {
  usage();
}
my ($username, $componentname) = ($1,$2);

my $conn = Net::XMPP::Client->new;
my $status = $conn->Connect(
  hostname => 'talk.google.com',
  port => 443,
  componentname => $componentname,
  connectiontype => 'http',
  tls => 0,
  ssl => 1,
);

# Change hostname
my $sid = $conn->{SESSION}->{id};
$conn->{STREAM}->{SIDS}->{$sid}->{hostname} = $componentname;

die "Connection failed: $!" unless defined $status;
my ( $res, $msg ) = $conn->AuthSend(
  username => $username,
  password => $password,
  resource => $resource, # client name
);
die "Auth failed ", defined $msg ? $msg : '', " $!"
unless defined $res and $res eq 'ok';

foreach my $recipient (split(',', $recipients)) {
  $conn->MessageSend(
    to => $recipient,
    resource => $resource,
    subject => 'message via ' . $resource,
    type => 'chat',
    body => $message,
  );
}

sub usage {
print qq{
$0 - Usage

-f "from account" (eg nagios\@myhost.org)
-p password
-r resource (default is "nagios")
-t "to account" (comma separated list or people to send the message to)
-m "message"
};
exit;

The Nagios Config

You need to install the script in a place where Nagios can execute it. Usually the place where your Nagios check plugins are installed is fine – on our systems that’s /usr/lib64/nagios/plugins.

Now we need to make Nagios aware of this plugin by defining a command. In your command.cfg, add two blocks like this (one for service notifications, one for host notifications). Replace <username> with the gmail account you registered, excluding the suffix, and replace <password> with the account password.

define command {
  command_name notify-by-xmpp
  command_line $USER1$/notify_by_xmpp -f <username@gmail.com> -p <password> -t $CONTACTEMAIL$ -m "$NOTIFICATIONTYPE$ $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$ $SERVICEOUTPUT$ $LONGDATETIME$"
}

define command {
  command_name host-notify-by-xmpp
  command_line $USER1$/notify_by_xmpp -f <username@gmail.com> -p <password> -t $CONTACTEMAIL$ -m "Host $HOSTALIAS$ is $HOSTSTATE$ - Info"
}

This command uses the Nagios user’s registered email address for XMPP as well as email. For this to work you will probably have to use the short (non-aliased) version of the email address, i.e. ab12345@bristol.ac.uk rather than alpha.beta@bristol.ac.uk. Change this in contact.cfg if necessary.

With the script in place and registered with Nagios, we just need to tell Nagios to use it. You can enable it per-user or just enable it for everyone. In this example we’ll enable it for everyone.

Look in templates.cfg and edit the generic contact template to look like this. Relevant changes are in bold.

define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   notify-service-by-email, notify-service-by-xmpp
        host_notification_commands      notify-host-by-email, notify-by-xmpp
        register                        0
}

Restart Nagios for the changes to take effect. Now break something, and wait for your IM client to receive the notification.