Rocks and /install directory on compute nodes

I noticed recently that some of our compute nodes were getting a bit short on disk space, as we have a fairly large set of (multiple versions of) large applications installed into /opt on each node rather than sharing those applications over NFS via /share/apps (which is apparently the usual Rocks way).

In looking at this I noticed that the /install directory on each compute node contained a complete copy of all packages installed during rebuild! In our case the /install directory was using about 35GB, mostly because of multiple versions of Matlab and Mathematica being installed (which are up to 10GB each now…).

Anyhow, to avoid this you can convert from using the <packages> XML tags in your extend-compute.xml file to using a post-install script (the <post> section) which calls yum install explicitly for the packages you want to install.
Be sure to also run yum clean packages regularly in the script, otherwise you’re just moving the packages from /install into the yum package cache in /var/cache/yum !

e.g. Convert this:


..into this:

yum install -y opt-matlab-R2016a
yum clean packages

This has allowed us to continue installing packages locally until we use up that extra 35GB 🙂

Users with UIDs < 1000 : workarounds and how RHEL 7.2 breaks some of these hacks

When RHEL/CentOS 7.2 was released there was a change in PAM configs which authconfig generates.
For most people this won’t have made any difference, but if you occasionally use entries in /etc/passwd to override user information from other sources (e.g. NIS, LDAP) then this can bite you.

The RHEL bug here shows the difference and discussion around it, which can be summarised in the following small change.

In CentOS 7.1 you see this line in the PAM configs:

auth sufficient nullok try_first_pass

…whilst in 7.2 it changes to this:

auth [success=done ignore=ignore default=die] nullok try_first_pass

The difference here is that the `pam_unix` entry essentially changes from (in PAM terms) “sufficient” to “required”, and any failure there means that authentication is denied.

“Well, my users are in LDAP so this won’t affect me!”

Probably, but if you happen to add an entry to /etc/passwd to override something for a user, such as their shell, home directory or (in this case) their UID (yes, hacky I know, but old NIS habits die hard…):

testuser:*:12345:12345:Test User:/home/testuser:/bin/bash

..then this means that your user is defined as being a target for the pam_unix module (since it’s a local user defined in the local passwd file), and from 7.2 you hit that modified pam_unix line and get auth failures. In 7.1 you’d get entries in the logs saying that pam_unix denied access but it would continue on through the subsequent possibilities (pam_ldap, pam_sss or whatever else you have in there) and check the password against those.

The bug referenced above suggests a workaround of using authconfig --enablenis, as this happens to set the pam_unix line back to the old version, but that has a bunch of other unwanted effects (like enabling NIS in /etc/nsswitch.conf).

Obviously the real fix for our particular case is to not change the UID (which was a terrible hack anyway) but to reduce the UID_MIN used in /etc/logins.def to below the minimum UID required, and hope that there aren’t any clashes between users in LDAP and users which have already been created by packages being added (possibly ages ago…).

Hopefully this saves someone else some trouble with this surprise change in behaviour when upgrades are applied to existing machines and authconfig runs!

Some additional notes:

  • This won’t change until you next run authconfig, which in this case was loooong after the 7.2 update…
  • Not recommended : Putting pam_sss or pam_ldap before pam_unix as a workaround for the pam_unix failures in 7.2. Terrible problems happen with trying to use local users (including root!) if the network goes away.
  • Adding the UID_MIN change as early as possible is a good idea, so in your bootstrap process would be sensible, to avoid package-created users getting added with UIDs near to 1000.
  • These Puppet modules were used in creation of this blog post:
    • joshbeard/login_defs 0.2.0
    • sgnl05/sssd 0.2.1

Merging SELinux policies

We make extensive use of SELinux on all our systems. We manage SELinux config and policy with the jfryman/selinux Puppet module, which means we store SELinux policies in plain text .te format – the same format that audit2allow generates them in.

One of our SELinux policies that covers permissions for NRPE is a large file. When we generate new rules (e.g. for new Nagios plugins) with audit2allow it’s a tedious process to merge the new rules in by hand and mistakes are easy to make.

So I wrote semerge – a tool to merge SELinux policy files with the ability to mix and match stdin/stdout and reading/writing files.

This example accepts input from audit2allow and merges the new rules into an existing policy:

cat /var/log/audit/audit.log | audit2allow | semerge -i existingpolicy.pp -o existingpolicy.pp

And this example deduplicates and alphabetises an existing policy:

semerge -i existingpolicy.pp -o existingpolicy.pp

There are probably bugs so please do let me know if you find it useful and log an issue if you run into problems.

DNS Internals: delegating a subdomain to a server listening on a non-standard port

I’m writing this up because it took me quite some time to get my head around how to do this, and I found answers around the internet varying from “not possible” through to “try this” (which didn’t work) and “switch off this security feature you really like having” (no)

I found a way to make it happen, but it’s not easy. I’ll walk you through the problem, and how each way I attempted to solve it failed.

All the names below are hypotheticals, and for the sake of argument we’re trying to make “foo.subdomain.local” resolve via the additional server.

Suppose you have two DNS servers. One which we’ll call “NS1” and one which we’ll call “NS-NEW”.

  • NS1 is a recursive server running bind, which all your clients point at to get their DNS information. It’s listening on port 53 as standard.
  • NS-NEW is an authoritative server which is listening on a non-standard port (8600) and for these purposes it’s a black box, we can’t change its behaviour.

You want your clients to be able to resolve the names that NS-NEW is authoritative for, but you don’t want to have to reconfigure the clients. So NS1 needs to know to pass those queries on to NS-NEW to get an answer.

Attempt 1 – “slave zone”
My first thought was to configure NS1 to slave the zone from NS-NEW.

zone "subdomain.local" {
        type slave;
        file "/var/named/slave/";
        masters { $IP_OF_NS-NEW port 8600; };

This didn’t work for me because NS-NEW isn’t capable of doing zone transfers. Pity, as that would have been really neat and easy to manage!

Attempt 2 – “forward zone”
Then I tried forwarding queries from NS1 to NS-NEW, by using binds “forward zone” features.

zone "subdomain.local" {
        type forward;
        forward only;
        forwarders { $IP_OF_NS-NEW port 8600; };

This didn’t work because NS1 is configured to check for valid DNSSEC signatures. The root zone says that all its children are signed, and bind takes that to mean that all the grandchildren of the root should be signed as well.

The software running on NS-NEW isn’t capable of signing its zone information.

It doesn’t appear to be possible to selectively turn off DNSSEC checking on a per-zone basis, and I didn’t want to turn that off for our whole infrastructure as DNSSEC is generally a Good Thing.

Attempt 3 – “delegation”
I did think I could probably work around it by making NS1 authoritative for the “local.” top level domain, then using NS records in the zonefile for “local.” to directly delegate the zone to NS-NEW.

Something like this:

$TTL 86400	; default TTL for this zone
$ORIGIN local.
@       IN  SOA (
                     2016031766 ; serial number
                     28800      ; refresh
                     7200       ; retry
                     604800     ; expire
                     3600       ; minimum
        IN  NS

; delegated zones
subdomain  IN  NS

Unfortunately that doesn’t work either, as it’s not possible to specify a port number in an NS record, and NS-NEW isn’t listening on a standard port.

Attempt 3 – “a little of option 2 and a little of option 3”
Hold on to your hats, this gets a little self referential.

I made NS1 authoritative for “local.”

zone "local" {
        type master;
        file "/var/named/data/zone.local";

I configured NS records in the “local.” zone file, which point back at NS1

$TTL 86400	; default TTL for this zone
$ORIGIN local.
@       IN  SOA (
                     2016031766 ; serial number
                     28800      ; refresh
                     7200       ; retry
                     604800     ; expire
                     3600       ; minimum
        IN  NS

; delegated zones
subdomain  IN  NS

I then configured a “subdomain.local.” forward zone on NS1 which forwards queries on to NS-NEW

zone "subdomain.local" {
        type forward;
        forward only;
        forwarders { $IP_OF_NS-NEW port 8600; };

To understand why this works, you need to understand how the recursion process for a query like “foo.subdomain.local.” happens.

When the query comes in NS1 does this:
– do I already know the answer from a previously cached query? Let’s assume no for now.
– do I know which DNS server is responsible for “subdomain.local.” from a previously cached query? Lets assume no for now.
– do I know which DNS server is responsible for “local.” – ooh! Yes! That’s me!
– now I can look in the zone file for “local.” and look to see how I resolve “subdomain.local.” – there’s an NS record which says I should ask NS1 in an authoritative way.
– now I ask NS1 for an answer to “foo.subdomain.local.”
– NS1 can then forward my query off to NS-NEW and fetch an answer.

Because we haven’t had to go all the way up to the root to get our answer, we avoid encountering the DNSSEC issue for this zone.

Did you really do it like *that*?
Yes and no.

The above is a simplified version of what I actually had to do, as our production equivalent of NS1 isn’t a single server – and I had to take account of our zone file management process, and all of that adds complexity which I don’t need to go into.

There are also a few extra hoops to jump through to make sure that the “local.” domain can only be accessed by clients on our network, and to make sure that our authoritative infrastructure doesn’t “leak” the “local.” zone to the outside world.

What would you have liked to have done?
If NS-NEW was able to listen on a standard port, I’d have used a straight delegation to do it.

If NS-NEW was able to sign it’s zone data with DNSSEC, I’d have used a simple forward zone to do it.

NS-NEW isn’t *quite* the black box I treated it as in this article, but the restriction about not being able to make it listen on port 53 is a real one.

The software running on NS-NEW does have a feature request in it’s issue tracker for DNSSEC, which I’ll watch with interest – as that would allow me to tidy up our config and might actually enable some other cool stuff further down the line…

Xen VM networking and tcpdump — checksum errors?

Whilst searching for reasons that a CentOS 6 samba gateway VM we run in ZD (as a “Fog VM” on a Xen hypervisor) was giving poor performance and seemingly dropping connections during long transfers, I found this sort of output from tcpdump -v host <IP_of_samba_client> on the samba server:

10:44:07.228431 IP (tos 0x0, ttl 64, id 64091, offset 0, flags [DF], proto TCP (6), length 91) > Flags [P.], cksum 0x3e38 (incorrect -> 0x67bc), seq 6920:6959, ack 4130, win 281, options [nop,nop,TS val 4043312256 ecr 1093389661], length 39SMB PACKET: SMBtconX (REPLY)

10:44:07.268589 IP (tos 0x0, ttl 60, id 18217, offset 0, flags [DF], proto TCP (6), length 52) > Flags [.], cksum 0x1303 (correct), ack 6959, win 219, options [nop,nop,TS val 1093389712 ecr 4043312256], length 0

Note that the cksum field is showing up as incorrect on all sent packets. Apparently this is normal when hardware checksumming is used, and we don’t see incorrect checksums at the other end of the connection (by the time this shows up on the client).

However, testing has shown that the reliability and performance of connections to the samba server on this VM is much greater when hardware checksumming is disabled with:

ethtool -K eth0 tx off

Perhaps our version of Xen on the hypervisor, or a combination of the version of Xen and versions of drivers on the hypervisor and/or client VM are causing networking problems?

My networking-fu is weak, so I couldn’t say more than what I have observed, even though this shouldn’t really make a difference…
(Please comment if you have info on why/what may be going on here!)

Ubuntu bug 31273 shows others with similar problems which were solved by disabling hardware checksum offloading.

Molly-guard for CentOS 7?

Since I was looking at this already and had a few things to investigate and fix in our systemd-using hosts, I checked how plausible it is to insert a molly-guard-like password prompt as part of the reboot/shutdown process on CentOS 7 (i.e. using systemd).

Problems encountered include:

  • Asking for a password from a service/unit in systemd — Use systemd-ask-password and needs some agent setup to reply to this correctly?
  • The reboot command always walls a message to all logged in users before it even runs the new reboot-molly unit, as it expects a reboot to happen. The argument --no-wall stops this but that requires a change to the reboot command. Hence back to the original problem of replacing packaged files/symlinks with RPM
  • The unit is a “systemd.special” unit, which means that it has some special behaviour and cannot be renamed. We can modify it, of course, by editing the file.
  • How do we get a systemd unit to run first and block anything later from running until it is complete? (In fact to abort the reboot but just for this time rather than being set as permanently failed. Reboot failing is a bit of a strange situation for it to be in…) The dependencies appear to work but the reboot target is quite keen on running other items from the dependency list — I’m more than likely doing something wrong here!

So for now this is shelved. It would be nice to have a solution though, so any hints from systemd experts are greatfully received!

(Note that CentOS 7 uses systemd 208, so new features in later versions which help won’t be available to us)

Capacity Planning for DNS

I’ve spent the last 6 months working on our DNS infrastructure, wrangling it into a more modern shape.

This is the first in a series of articles talking about some of the process we’ve been through and outlining some of the improvements we’ve made.

One of the exercises we try to go through when designing any new production infrastructure is capacity planning. There are four questions you need to be able to ask when you’re doing this:

  1. How much traffic do we need to handle today?
  2. How are we expecting traffic to grow?
  3. How much traffic can the infrastructure handle?
  4. How much headroom have we got?

We aim to be in a position where we can ask those four questions on a regular basis, and preferably get useful answers to them!

When it comes to DNS, the most useful metric would appear to be “queries/second” (which I’ll refer to as qps from here on in to save a load of typing!) and bind can give us that information fairly readily with it’s built in statistics gathering features.

With that in mind, lets look at those 4 questions.

1. How much traffic do we need to handle today?
The best way to get hold of that information is to collect the qps metrics from our DNS infrastructure and graph them.

This is quite a popular thing to do and most monitoring tools (eg nagios, munin or ganglia) have well worn solutions available, and for everything else there’s google

Unfortunately we weren’t able to collate these stats from the core of the legacy DNS infrastructure in a meaningful way (due to differences in bind version, lack of a sensible aggregation point etc)

Instead, we had to infer it from other sources that we can/do monitor, for example the caching resolvers we use for eduroam.

Our eduroam wireless network is used by over 30,000 client devices a week. We think this around 60% of the total devices on the network, so it’s a fairly good proxy for the whole university network.

We looked at what the eduroam resolvers were handling at peak time (revision season), doubled it and added a bit. Not a particularly scientific approach, but it’s likely to be over-generous which is no bad thing in this case!

That gave us a ballpark figure of “we need to handle around 4000qps”

2. How are we expecting traffic to grow?
We don’t really have long term trend information for the central DNS service due to the historical lack of monitoring.

Again inferring generalities from eduroam, the number of clients on the network goes up by 20-30% year on year (and has done since 2011) Taking 30% growth year on year as our growth rate, and expanding that over 5 years it looks like this:

dns growth

Or in 5 years time we think we’ll need around 15,000qps.

All estimates in this process being on the generous side, and due to the compound nature of the year-on-year growth calculations, that should be a significant overestimate.

It will certainly be an interesting figure to revisit in 5 years time!

3. How much traffic can the infrastructure handle?
To answer this one, we need some benchmarking tools. After a bit of research I settled on dnsperf. The mechanics of how to run dnsperf (and how to gather a realistic sample dataset) are best left for another time.

All tests were done against the pre-production infrastructure so as not to interfere with live traffic.

Lets look at the graphs we get out at the end.

The new infrastructure:

Interpreting this graph isn’t immediately obvious. The way dnsperf works is that it linearly scales the number of queries/second that it’s sending to your DNS server, and tracks how many responses it gets back per second.

So the red line is how many queries/second we’re testing against, and the green line is how the server is responding. Where the two lines diverge shows you where your infrastructure starts to struggle.

In this case, the new infrastructure appears to cope quite well with around 30,000qps – or about twice what we’re expecting to need in 5 years time. That’s with all (or rather, both!) the servers in the pool available, so do we still have n+1 redundancy?

A single node in the new infrastructure:

From this graph you can see we’re good up to around 14000qps, so we’re n+1 redundant for at least the next 3-4 years (the lifetime of the harware we’re using)

At present, we have 2 nodes in the pool, the implication from the two graphs is that it does indeed scale approximately linearly with the number of servers in the pool.

4. How much headroom have we got?
At this point, the answer to that looks like “plenty” and with the new infrastructure we should be able to scale out almost linearly by adding more servers to the pool.

Now that we know how much we can expect our infrastructure to handle, and how much it’s actually experiencing, we can make informed decisions about when we need add more resources in order to maintain at least n+1 redundancy.

What about the legacy infrastructure?
Well, the reason I’m writing this post today (rather than any other day) is that we retired the oldest of the servers in the legacy infrastructure today, and I wanted to fire dnsperf at it, after it’s stopped handling live traffic but before we switch it off completely!

So how many queries/second can a 2005 vintage Sun Fire V240 server cope with?


It seems the answer to that is “not really enough for 2015!”

No wonder its response times were atrocious…

F5 Big-IP, Apache logs and client IP

Due to the fact that the F5 Big-IP make use of SNAT to load balance traffic, your back-end node will see the traffic coming from the IP of the load balancer and not the true client.
To overcome this (for web traffic at least), the F5 injects the X-Forwarded-For header in to HTTP steams with the true clients IP.

In apache you may want to log this IP instead of the remote host if it has been set. Using the SetEnvIf, we can produce a suitable LogFormat line based on if the X-Forwarded-For header is set or not:

CustomLog "/path/o/log/dir/example.com_access.log" combined env=!forwarded
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\"" proxy
SetEnvIf X-Forwarded-For "^.*\..*\..*\..*" forwarded
CustomLog "/path/to/log/dir/example.com_access.log" proxy env=forwarded

The above assumes that the “combined” LogFormat has already been defined.

If you use the ::apache::vhost puppet class from the puppetlabs/apache module, you can achieve the same result with the following parameters:

::apache::vhost { '':
  logroot => "/path/to/log/dir/",
  access_log_env_var => "!forwarded",
  custom_fragment => "LogFormat \"%{X-Forwarded-For}i %l %u %t \\\"%r\\\" %s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\"    proxy
  SetEnvIf X-Forwarded-For \"^.*\..*\..*\..*\" forwarded
  CustomLog \"/path/to/log/dir/${title}_access.log\" proxy env=forwarded"

Puppet future parser — what to expect that you’ll have to update in your manifests…

The Puppet Future Parser is the new implementation of the manifest parser which will become the default in 4.0, so I thought I’d take a look to see what I’d need to update.

Also, there are some fancy new features like iteration and that you can use [1,2] array notation or {a=>b} hash notation anywhere that you’d previously used a variable containing an array or hash.

The iteration and lambda features are intended to replace create_resources calls, as they are more flexible and can loop round repeatedly to create individual definitions.

For example, here’s a dumb “sudo” profile which uses the each construct to iterate over an array:

class profiles::sudo {
  # This is a particularly dumb version of use of sudo, to allow any commands:
  $admin_users = hiera_array('admin_users')
  # Additional users with special sudo rights, but no ssh access (e.g. root):
  $sudo_users  = hiera_array('sudo_users')

  class { ::sudo: }

  $all_sudo_users = concat($sudo_users, $admin_users)

  # Create a resource for each entry in the array:
  each($all_sudo_users) |$u| {
    sudo::entry { $u:
      comment  => "Allow ${u} to run anything as any user",
      username => $u,
      host     => 'ALL',
      as_user  => 'ALL',
      as_group => 'ALL',
      nopasswd => false,
      cmd      => 'ALL',

Making this work with create_resources and trying to splice in the the username for each user in the list into a hash looked like it would be messy, requiring at least an additional layer of define — this method is much neater.

This makes it much easier to create data abstractions over existing modules — you can programmatically massage the data you read from your hiera files and call definitions using that data in a much more flexible way than when passing hashes to create_resources. This “glue” can be separated into your roles and profiles (which could be the subject of another post but are described well in this blog post), creating a layer which separates the use of the module from the data which drives that use nicely.

So this all sounds pretty great, but there are a few changes you’ll possibly encounter when switching to the future parser:

  • Similar to the switch from puppet master to puppet server, the future parser is somewhat more strict about data formats. e.g. I found that my hiera data definitely needed to be properly quoted when I started using puppet server, so entries like mode : 644 in a file hash wouldn’t give the number you were expecting… (needs mode : 0644 or mode : '644' to avoid conversion from octal to decimal…). The future parser extends this to being more strict in your manifests, so a similarly-incorrect file { ... mode => 644 } declaration needs quoting or a leading zero. If you use puppet-lint you’ll catch this anyway — so use it! 🙂
  • It’s necessary to use {} instead of undef when setting default values for hiera_hash (and likewise [] instead of undef for hiera_array), to allow conditional expressions of the form if $var { ... } to work as intended. It seems that in terms of falseness for arrays and hashes that undef is in fact true… (could be a bug, as this page in the docs says: “When used as a boolean, undef is false”)
  • Dynamically-scoped variables (which are pretty mad and difficult to follow anyway, which is why most languages avoid them like the plague…) don’t pass between a class and any sub-classes which it creates. This is in the docs here, but it’s such a common pattern that it could well have made it through from your old (pre-Puppet 2.7) manifests and still have been working OK until the switch to the future parser. e.g.:
    class foo {
      $var = "x"
    class bar {
      include foo
      # $var isn't defined here, as dynamic scope rules don't allow it in Puppet >2.7

    Instead you need to explicitly qualify your variables to pull them out of the correct scope — $foo::var in this case. In your erb templates, as a common place where the dynamically-scoped variables might have ended up getting used, you can now use scope['::foo::var'] as a shorthand for the previously-longer scope.lookupvar('::foo::var') to explicitly qualify the lookup of variables. The actual scope rules for Puppet < 2.7 are somewhat more complicated and often led to confusing situations if you unintentionally used dynamic scoping, especially when combined with overriding variables from the parent scope…

  • I’m not sure that expressions of the form if "foo" in $arrayvar { ... } work how they should, but I’ve not had a chance to investigate this properly yet.

Most of these are technically the parser more strictly adhering to the specifications, but it’s easy to have accidentally had them creep into your manifests if you’re not being good and using puppet-lint and other tools to check them.

In conclusion : Start using the Future Parser soon! It adds excellent features for iteration which make abstracting data a whole lot easier than using the non-future (past?) parser allows. Suddenly the combination of roles, profiles and the iteration facilities in the future parser mean that abstraction using Puppet and hiera makes an awful lot more sense!

eduroam in Freshers’ Week: Some graphs

This year, eduroam is an interesting service. Not only has it been running on all-new Fog-based infrastructure since the mid-summer, but eduroam and ResNet Wireless have been merged to form one authenticated wireless network to rule them all with more users than ever before.

We’ve been watching how the servers are performing under load for the first time with great interest. Let’s have a look at some of the numbers. All of these graphs show the time from midnight on Saturday 21st September (UK freshers arrive) to midnight on Monday 30th September (end of weekend after freshers week).

First let’s have a look at the graph of RADIUS authentication requests – scaled to show Access-Request packets received per second (not necessarily authentications per second, as one authentication usually constitutes several Access-Requests).

The gentle swell of users on the weekend of the 28th/29th is mostly undergraduates using eduroam in residences. The taller spikes on weekdays is made up by University staff using eduroam on campus.

radius01 usually serves eduroam users on campus, at Bristol – both Bristol students and visitors from other eduroam institutions. radius02 usually serves Bristol users who are visiting other institutions, while radius03 usually authenticates any user authenticating at eduroam hotspots in Bristol City Council buildings, and certain local hospitals. However the servers have got each other’s backs, and will take on additional roles at will if there’s an outage.

There are a couple of large spikes on the graph of radius02 with matching notches on the graph of radius01. These are times when the servers decided to shuffle around and radius02 temporarily became the primary for campus users, which is by far the largest set of users.


Now for DNS. Pretty self explanatory – these graphs for the two DNS servers show the number of successful queries per second. At peak, times, we serve over 600 lookups per second.

dns5-success dns6-success

This graph shows the number of valid IPv4 DHCP leases currently in the lease file. We don’t currently graph IPv6 leases so there’s no way of knowing how many there are. Adding this is on my to-do list 🙂


Last but not least, this graph shows the number of queries per second for both nodes in the MySQL cluster that powers eduroam, ResNet and related services. It’s used for authentications, logging and infrastructure management. The two nodes are in a cluster and can rearrange themselves at will, but at any one time, there is one master and one slave. By default db1 is usually the master and handles mostly INSERTs and UPDATEs while db2 is usually the slave and handles only SELECTs.

db1-qps db2-qps