One year of ResNet Gitlab

Today, it has been one year since the first Merge Request (MR) was created and accepted by ResNet* Gitlab. During that time, about 250 working days, we have processed 462 MRs as part of our Puppet workflow. That’s almost two a day!

We introduced Git and Gitlab into our workflow to replace the ageing svn component which didn’t handle branching and merging well at all. Jumping to Git’s versatile branching model and more recently adding r10k into the mix has made it trivially easy to spin up ephemeral dev environments to work on features and fixes, and then to test and release them into the production environment safely.

We honestly can’t work out how on earth we used to cope without such a cool workflow.

Happy Birthday, ResNet Gitlab!

* 1990s ResNet brand for historical reasons only – this Gitlab installation is used mostly for managing eduroam and DNS. Maybe NetOps would have been a better name 🙂

Interesting NFS4 failures on some CentOS 7 clients

We’ve had a couple of clients which use a combination of:

  • NFS4 (no encryption yet, just simple NFS4 with idmapd set as domain bris.ac.uk)
  • automounted file-servers on /net/<hostname>, with bind mounts to get these to appear in /home/<username>
  • CentOS 7
  • kernel 3.10.0-229.20.1.el7.x86_64 (newest at the moment)

..where for some reason they decided to lose the NFS server connection.

Normally you’d just restart a bunch of NFS daemons, along with a ‘umount -l’ of the stuck mounts and continue on your way. In this case, however, we got a kernel thread showing up in uninterruptible sleep
with a name of [<IP-of-NFS-server>-ma]. (Square brackets because kernel threads show up in ps output in that format)

Looking at /proc/<PID>/stack for this PID showed that it was trying to recover a failed NFS mount, and then was waiting for an RPC response
(stuck in the rpc_wait_bit_killable function, despite not being possible to kill).

Has anyone else seen this behaviour?

I’ve just had two servers with the exact same install and kernel version do this, one at about 1am this morning and then the other at about 1pm
this afternoon.
A third machine with an identical install but a slightly older kernel version hasn’t hit this problem, so I’m erring toward it being a kernel NFS4 bug.

(Have rebooted both compute servers as that was the only possible recovery method it seemed, and picked a previous kernel version for one of the hosts to see if the behaviour changes…)

Posted in NFS

Dell servers, warranty facts and refresh-mcollective-metadata

On our physical Dell servers we install the dell-omsa packages which give us the ability to monitor our underlying hardware.

With that in place, you can use facter to report on all sorts of useful things about the hardware, including the state of the warranty.

The fact which checks warranty information, uses dell-omsa to pull the service tag of the server and submits it to Dells API – which then returns info about the status of your warranty.

You can then use mcollective to report on this. This can be really useful if you can’t remember what you bought when!

Unfortunately, from time to time it breaks and we start getting cronjob output which looks like this:

/usr/libexec/mcollective/refresh-mcollective-metadata
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
Could not retrieve fact='warranty_start', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass

This happens just frequently enough to be a familiar problem for us, but not frequently enough for the fix to stick in my mind!

Googling for the error messages yeilds a couple of mailinglist threads asking about this error and how to work around it – which were both started by my colleague Jonathan Gazeley the first time we hit the problem. [1]

There are no actual fixes in those threads, although one post did hint at the root cause being mcollective caching the result of the Dell API call – without actually stating where it gets cached.

So, it’s strace time!

sudo strace -e open /usr/libexec/mcollective/refresh-mcollective-metadata 2>&1 | less

Skip to the end, and page back until you get to the bit where it starts complaining about the warranty fact, and you find that it’s trying to open /var/tmp/dell-warranty-XXXXXXX.json where XXXXXXX is the service tag of the hardware.

...
open("/var/tmp/dell-warranty-XXXXXXX.json", O_RDONLY) = 3
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
...

In our most recent case, the contents of that file looked like this:

$ cat /var/tmp/dell-warranty-XXXXXXX.json 
{
  "GetAssetWarrantyResponse": {
    "GetAssetWarrantyResult": {
      "Response": null,
      "Faults": {
        "FaultException": {
          "Message": "The tag you sent is not present. Check your separator character and ensure it is |.",
          "Code": 4001
        }
      }
    }
  }
}

That looks a lot to me like the API call failed for some reason.

The fix is to remove that stale cache file and re-run the mcollective-refresh-metadata script.

$ sudo rm /var/tmp/dell-warranty-*.json
$ sudo /usr/libexec/mcollective/refresh-mcollective-metadata

Then inspect the cached file again. It should now contain a lot of warranty info.

If it doesn’t, well… then you need to start working out why, and that’s an exercise left for the reader!

-Paul
[1] https://groups.google.com/forum/#!msg/puppet-users/LsK3HbEBMGc/-DSIOMNCDzIJ

I freely admit that the intent behind this post is mostly about getting the “fix” into those google search results – so I don’t have to resort to strace next time it happens!