25 years of Internet at University of Bristol

25 years ago today, the University of Bristol joined the Internet.

Well, that’s the headline – but it’s not entirely accurate. By 1991, the University had been connected to other universities around the UK for a while. JANET had been established in 1984 and by 1991 had gateways to ARPANET so by the “small i” definition of internet we were already on the internet.

These days, when we talk about “the Internet” we’re mostly talking about the global TCP/IP network.

In 1991 JANET launched the JANET IP Service (JIPS) which signalled the changeover from Coloured Book software to TCP/IP within the UK academic network. [1]

On the 8th March 1991, the University of Bristol received it’s allocation of the block of public IPv4 address space which we’re still using today.

What follows is a copy of the confirmation email[2] we received from the branch of the American Department of Defence (NIC) which was responsible at the time for allocating address space, and it describes the Class B network 137.222.0.0 had been assigned to us.


---------- Forwarded Message ----------
Date: 08 March 1991 12:46 -0800
From: HOSTMASTER@mil.ddn.nic
To: RICHARD.HOPKINS@uk.ac.bristol
Cc: hostmaster@mil.ddn.nic
Subject: BRISTOL-NET

Richard,

The new class and network number for BRISTOL-NET is:

Class B, #137.222.0.0

NIC Handle of technical POC is: RH438

The NIC handle is an internal record searching tool. If a new Technical
Point of Contact was registered with this application a new NIC handle
has been assigned. If the Technical POC was already registered at the
NIC but their handle was not provided in the application, it has been
listed here for your reference and for use in all future correspondence
with the NIC.

If you require the registration of any hosts or gateways on this
network in the DoD Internet Host Table maintained by the NIC, send the
names and network addresses of these hosts and gateways to
HOSTMASTER@NIC.DDN.MIL.

PLEASE NOTE: The DoD Internet Host Table has grown quite large and
is approaching the limits of manageability. The NIC strongly
discourages the registration of new hosts in the table except in
cases where interoperability with MILNET is essential.
At most, the NIC is prepared to accept no more than 10 initial
registrations from new networks. We encourage you to register any
new hosts or gateways with the domain name servers that will handle
the information your hosts.

It is suggested that host number zero in any network be reserved (not
used), and the host address of all ones (255 in class C networks) in any
network be used to indicate a broadcast datagram.

The association between addresses used in the particular network
hardware and the Internet addresses may be established and maintained by
any method you select. Use of the address resolution procedure
described in RFC 826 is encouraged.

Thanks again for your cooperation!
Linda Medina
-------
---------- End Forwarded Message ----------

So happy quarter-of-a-century-of-IPv4 everyone!

[1] Dates taken from Hobbes’ Internet Timeline http://www.zakon.org/robert/internet/timeline/
[2] The sharp-eyed amongst you will have noticed the format of the To: address that was in use at that time…

Fusion MPT SAS-2 / sas2ircu disk replacement

Just a quick tip for anyone confused about how to replace a failed disk on a Fusion MPT SAS-2 controller under Linux (shows up as 02:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) via lspci).

The sas2ircu command-line tool is quite “light” on features, and it wasn’t at all obvious to me how to get a replacement disk to re-add to an array. There aren’t any options for replacing a disk in an array, and the server in question has a very minimal remote management console which doesn’t even mention storage at all…

The replacement disk showed up as “Ready (RDY)” in the output of the sas2ircu 0 DISPLAY command, but didn’t automatically replace the failed disk in the array and cause a rebuild.

The only available option for replacing the disk was to set it as a “hot spare” with:

sas2ircu 0 hotspare 2:10

— the disk in question was 2:10 as it was the tenth disk on what showed up as the second (for some reason, even though there’s only one!) enclosure.

This gives a large warning about data loss or corruption, to which you must (after ensuring it’s the correct disk ID!) say YES. Then it adds that disk as a hot spare and then immediately turns uses this for the rebuild of the array with the failed disk.
This adds it back into the array as though nothing had failed at all — which is what I wanted, but couldn’t see another way to do it!

Fairly odd but easy to remember once you realise that there’s no other option with sas2ircu to allow you to replace a failed disk in an array! 🙂

(Maybe there are other tools which make this more obvious, but sas2ircu is the only one I had to hand)

One year of ResNet Gitlab

Today, it has been one year since the first Merge Request (MR) was created and accepted by ResNet* Gitlab. During that time, about 250 working days, we have processed 462 MRs as part of our Puppet workflow. That’s almost two a day!

We introduced Git and Gitlab into our workflow to replace the ageing svn component which didn’t handle branching and merging well at all. Jumping to Git’s versatile branching model and more recently adding r10k into the mix has made it trivially easy to spin up ephemeral dev environments to work on features and fixes, and then to test and release them into the production environment safely.

We honestly can’t work out how on earth we used to cope without such a cool workflow.

Happy Birthday, ResNet Gitlab!

* 1990s ResNet brand for historical reasons only – this Gitlab installation is used mostly for managing eduroam and DNS. Maybe NetOps would have been a better name 🙂

Interesting NFS4 failures on some CentOS 7 clients

We’ve had a couple of clients which use a combination of:

  • NFS4 (no encryption yet, just simple NFS4 with idmapd set as domain bris.ac.uk)
  • automounted file-servers on /net/<hostname>, with bind mounts to get these to appear in /home/<username>
  • CentOS 7
  • kernel 3.10.0-229.20.1.el7.x86_64 (newest at the moment)

..where for some reason they decided to lose the NFS server connection.

Normally you’d just restart a bunch of NFS daemons, along with a ‘umount -l’ of the stuck mounts and continue on your way. In this case, however, we got a kernel thread showing up in uninterruptible sleep
with a name of [<IP-of-NFS-server>-ma]. (Square brackets because kernel threads show up in ps output in that format)

Looking at /proc/<PID>/stack for this PID showed that it was trying to recover a failed NFS mount, and then was waiting for an RPC response
(stuck in the rpc_wait_bit_killable function, despite not being possible to kill).

Has anyone else seen this behaviour?

I’ve just had two servers with the exact same install and kernel version do this, one at about 1am this morning and then the other at about 1pm
this afternoon.
A third machine with an identical install but a slightly older kernel version hasn’t hit this problem, so I’m erring toward it being a kernel NFS4 bug.

(Have rebooted both compute servers as that was the only possible recovery method it seemed, and picked a previous kernel version for one of the hosts to see if the behaviour changes…)

Posted in NFS

Dell servers, warranty facts and refresh-mcollective-metadata

On our physical Dell servers we install the dell-omsa packages which give us the ability to monitor our underlying hardware.

With that in place, you can use facter to report on all sorts of useful things about the hardware, including the state of the warranty.

The fact which checks warranty information, uses dell-omsa to pull the service tag of the server and submits it to Dells API – which then returns info about the status of your warranty.

You can then use mcollective to report on this. This can be really useful if you can’t remember what you bought when!

Unfortunately, from time to time it breaks and we start getting cronjob output which looks like this:

/usr/libexec/mcollective/refresh-mcollective-metadata
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
Could not retrieve fact='warranty_start', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass

This happens just frequently enough to be a familiar problem for us, but not frequently enough for the fix to stick in my mind!

Googling for the error messages yeilds a couple of mailinglist threads asking about this error and how to work around it – which were both started by my colleague Jonathan Gazeley the first time we hit the problem. [1]

There are no actual fixes in those threads, although one post did hint at the root cause being mcollective caching the result of the Dell API call – without actually stating where it gets cached.

So, it’s strace time!

sudo strace -e open /usr/libexec/mcollective/refresh-mcollective-metadata 2>&1 | less

Skip to the end, and page back until you get to the bit where it starts complaining about the warranty fact, and you find that it’s trying to open /var/tmp/dell-warranty-XXXXXXX.json where XXXXXXX is the service tag of the hardware.

...
open("/var/tmp/dell-warranty-XXXXXXX.json", O_RDONLY) = 3
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
...

In our most recent case, the contents of that file looked like this:

$ cat /var/tmp/dell-warranty-XXXXXXX.json 
{
  "GetAssetWarrantyResponse": {
    "GetAssetWarrantyResult": {
      "Response": null,
      "Faults": {
        "FaultException": {
          "Message": "The tag you sent is not present. Check your separator character and ensure it is |.",
          "Code": 4001
        }
      }
    }
  }
}

That looks a lot to me like the API call failed for some reason.

The fix is to remove that stale cache file and re-run the mcollective-refresh-metadata script.

$ sudo rm /var/tmp/dell-warranty-*.json
$ sudo /usr/libexec/mcollective/refresh-mcollective-metadata

Then inspect the cached file again. It should now contain a lot of warranty info.

If it doesn’t, well… then you need to start working out why, and that’s an exercise left for the reader!

-Paul
[1] https://groups.google.com/forum/#!msg/puppet-users/LsK3HbEBMGc/-DSIOMNCDzIJ

I freely admit that the intent behind this post is mostly about getting the “fix” into those google search results – so I don’t have to resort to strace next time it happens!

Striving to be agile in a non-agile environment

I recently read a post over at the Government Data Services blog, which struck a chord with me. It was titled How to be agile in a non agile environment

I work in the Wireless/DNS/ResNet team. We’re a small team and as such we’re keen to adopt technologies or working practices which enable us to deliver as stable a service as possible while still being able to respond to users requirements quickly.

Over the last few years, we’ve converged on an approach which could be described as “agile with a small a”

We’re not software developers, we’re probably better described as an infrastructure operations team. So some of the concepts in Agile need a little translation, or don’t quite fit our small team – but we’ve cherry picked our way into something which seems to get a fair amount of bang-per-buck.

At it’s heart, the core beliefs of our approach are that shipping many small changes is less disruptive than shipping one big change, that our production environment should always be in a deployable state (even if that means it’s missing features) and that we collect metrics about the use of our services and use that to inform the direction we move in.

We’ve been using Puppet to manage our servers for almost 5 years now – so we’re used to “infrastructure as code” – we’ve got git (with gitlab) for our source code control, r10k to deploy ephemeral environments for developing/testing in, and a workflow which allows us to push changes through dev/test/production phases with every change being peer reviewed before it hits production.

However – we’re still a part of IT Services, and the University of Bristol as a whole. We have to work within the frameworks which are available, and play the same game as everyone else.

The University isn’t a particularly agile environment, it’s hard for an institution as big and as long established as a University to be agile! There are governance processes to follow, working groups to involve, stakeholders to inform and engage and standard tools used by the rest of the organisation which don’t tie in to our toolchain particularly nicely…

Using our approach, we regularly push 5-10 production changes a day in a controlled manner (not bad for a team of 2 sysadmins) with very few failures[1]. Every one of those changes is recorded in our systems with a full trail of who made what change, the technical detail of the change implementation and a record of who signed it off.

Obviously it’s not feasible to take every single one of those changes to the weekly Change Advisory Board, if we did the meeting would take forever!

Instead, we take a pragmatic approach and ask ourselves some questions for every change we make:

  • Will anyone experience disruption if the change goes well?
  • Will anyone experience disruption if the change goes badly?

If the answer to either of those questions is yes, then we ask ourselves an additional question. “Will anyone experience disruption if we *don’t* make the change”

The answers to those three questions inform our decision to either postpone or deploy the change, and help us to decide when it should be added to the CAB agenda. I think we strike about the right balance.

We’re keen to engage with the rest of the organisation as there are benefits to us in doing so (and well, it’s just the right thing to do!) and hopefully by combining the best of both worlds we can continue to deliver stable services in a responsive manner and still move the services forward.

I feel the advice in the GDS post pretty much mirrors what we’re already doing, and it’s working well for us.

Hopefully it could work well for others too!

[1] I say “very few failures” and I’m sure that probably scares some people – the notion that any failure of a system or change could be in any way acceptable.

I strongly believe that there is value in failure. Every failure is an opportunity to improve a system or a process, or to design in some missing resilience. Perhaps I’ll write more about that another time, as it’s a bit off piste from what I indented to write about here!

Xen VM networking and tcpdump — checksum errors?

Whilst searching for reasons that a CentOS 6 samba gateway VM we run in ZD (as a “Fog VM” on a Xen hypervisor) was giving poor performance and seemingly dropping connections during long transfers, I found this sort of output from tcpdump -v host <IP_of_samba_client> on the samba server:

10:44:07.228431 IP (tos 0x0, ttl 64, id 64091, offset 0, flags [DF], proto TCP (6), length 91)
    server.example.org..microsoft-ds > client.example.org.44729: Flags [P.], cksum 0x3e38 (incorrect -> 0x67bc), seq 6920:6959, ack 4130, win 281, options [nop,nop,TS val 4043312256 ecr 1093389661], length 39SMB PACKET: SMBtconX (REPLY)

10:44:07.268589 IP (tos 0x0, ttl 60, id 18217, offset 0, flags [DF], proto TCP (6), length 52)
    client.example.org.44729 > server.example.org..microsoft-ds: Flags [.], cksum 0x1303 (correct), ack 6959, win 219, options [nop,nop,TS val 1093389712 ecr 4043312256], length 0

Note that the cksum field is showing up as incorrect on all sent packets. Apparently this is normal when hardware checksumming is used, and we don’t see incorrect checksums at the other end of the connection (by the time this shows up on the client).

However, testing has shown that the reliability and performance of connections to the samba server on this VM is much greater when hardware checksumming is disabled with:

ethtool -K eth0 tx off

Perhaps our version of Xen on the hypervisor, or a combination of the version of Xen and versions of drivers on the hypervisor and/or client VM are causing networking problems?

My networking-fu is weak, so I couldn’t say more than what I have observed, even though this shouldn’t really make a difference…
(Please comment if you have info on why/what may be going on here!)

Ubuntu bug 31273 shows others with similar problems which were solved by disabling hardware checksum offloading.

“KDC has no support for encryption type” with old ciphers against Active Directory

A single machine somehow managed to have a differently-configured /etc/krb5.conf file and recently stopped all (both ssh and on the console, except for root) logins from working. The messages in the logs were of the form:

Sep 29 15:04:58 test-host sshd[1433]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= host=host.example.com user=user12345
Sep 29 15:04:58 test-host sshd[1433]: pam_krb5[1433]: authentication fails for 'user12345' (user12345@REALM.EXAMPLE.COM): Authentication failure (KDC has no support for encryption type)
Sep 29 15:05:00 test-host sshd[1433]: Failed password for user12345 from 1.2.3.4 port 50432 ssh2

The reason for this was simple – the Kerberos config in /etc/krb5.conf contained the following lines:

[libdefaults]
        ... (other lines snipped)
        default_tkt_enctypes = des-cbc-crc
        default_tgs_enctypes = des-cbc-crc

These settings force the use of an older DES encryption type which is only 56-bit, and has been disabled since Windows 7/Windows Server 2008 R2. Removing these lines so that the encryption type is automatically negotiated allows stronger encryption to be used which is supported by the Active Directory servers, allowing us to login once more. Phew!

(This is a legacy CentOS 5 server, all the newer ones have the same Kerberos config on them — thankfully the same config works on CentOS 5/6/7 and Debian/Ubuntu without modifications thus far!)

What’s using all my swap?

On a couple of occasions recently, we’ve noticed swap use getting out of hand on a server or two. There’s been no common cause so far, but the troubleshooting approach has been the same in each case.

To try and tell the difference between a VM which is generally “just a bit tight on resources” and a situation where process has run away – it can sometimes be handy to work out what processes are hitting swap.

The approach I’ve been using isn’t particularly elegant, but it has proved useful so I’m documenting it here:

grep VmSwap /proc/*/status 2>&1 | perl -ne '/\/(\d+)\/[^\d]*(\d+) (.B)$/g;if($2>0){$name=`ps -p $1 -o comm=`;chomp($name);print "$name ($1) $2$3\n"}'

Lets pick it apart a component at a time.

grep VmSwap /proc/*/status 2>&1

The first step is to pull out the VmSwap line from the PID status files held in /proc. There’s one of these files for each process on the system and it tracks all sorts of stuff. VmSwap is how much swap is currently being used by this process. The grep gives output like this:

...
/proc/869/status:VmSwap:	     232 kB
/proc/897/status:VmSwap:	     136 kB
/proc/9039/status:VmSwap:	    5368 kB
/proc/9654/status:VmSwap:	     312 kB
...

That’s got a lot of useful info in it (eg the PID is there, as is the amount of swap in use), but it’s not particularly friendly. The PID is part of the filename, and it would be more useful if we could have the name of the process as well as the PID.

Time for some perl…

perl -ne '/\/(\d+)\/[^\d]*(\d+) (.B)$/g;if($2>0){$name=`ps -p $1 -o comm=`;chomp($name);print "$name ($1) $2$3\n"}'

Dealing with shell side of things first (before we dive into the perl code) “-ne” says to perl “I want you to run the following code against every line of input I pipe your way”.

The first thing we do in perl itself is run a regular expression across the line of input looking for three things; the PID, the amount of swap used and the units reported. When the regex matches, this info gets stored in $1, $2 and $3 respectively.

I’m pretty sure the units are always kB but matching the units as well seemed safer than assuming!

The if statement allows us to ignore processes which are using 0kB of swap because we don’t care about them, and they can cause problems for the next stage:

$name=`ps -p $1 -o comm=`;chomp($name)

To get the process name, we run a “ps” command in backticks, which allows us to capture the output. “-p $1” tells ps that we want information about a specific PID (which we matched earlier and stored in $1), and “-o comm=” specifies a custom output format which is just the process name.

chomp is there to strip the ‘\n’ off the end of the ps output.

print "$name ($1) $2$3\n"

Lastly we print out the $name of the process, it’s PID and the amount of swap it’s using.

So now, you get output like this:

...
automount (869) 232kB
cron (897) 136kB
munin-node (9039) 5364kB
exim4 (9654) 312kB
...

The output is a little untidy, and there is almost certainly a more elegant way to get the same information. If you have an improvement, let me know in the comments!

Molly-guard for CentOS 7?

Since I was looking at this already and had a few things to investigate and fix in our systemd-using hosts, I checked how plausible it is to insert a molly-guard-like password prompt as part of the reboot/shutdown process on CentOS 7 (i.e. using systemd).

Problems encountered include:

  • Asking for a password from a service/unit in systemd — Use systemd-ask-password and needs some agent setup to reply to this correctly?
  • The reboot command always walls a message to all logged in users before it even runs the new reboot-molly unit, as it expects a reboot to happen. The argument --no-wall stops this but that requires a change to the reboot command. Hence back to the original problem of replacing packaged files/symlinks with RPM
  • The reboot.target unit is a “systemd.special” unit, which means that it has some special behaviour and cannot be renamed. We can modify it, of course, by editing the reboot.target file.
  • How do we get a systemd unit to run first and block anything later from running until it is complete? (In fact to abort the reboot but just for this time rather than being set as permanently failed. Reboot failing is a bit of a strange situation for it to be in…) The dependencies appear to work but the reboot target is quite keen on running other items from the dependency list — I’m more than likely doing something wrong here!

So for now this is shelved. It would be nice to have a solution though, so any hints from systemd experts are greatfully received!

(Note that CentOS 7 uses systemd 208, so new features in later versions which help won’t be available to us)