SysAdmin Appreciation Day

Blimey, the “last Friday in July” has rolled round again quickly, and today marks International SysAdmin Appreciation Day.

So, I’d like to say a heartfelt thank you to all the sysadmins I interact with in IT Services!

Not just the linux/unix guys I work directly with, but the Windows team who provide me with useful things like authentication and RDP services, the network team who keep us talking to each other, the storage and backups team who provide me with somewhere to keep my data and anyone else who provides me with a service I consume!

You’re all doing great work which no one notices until it breaks, and then you unfailingly go above and beyond to restore service as soon as possible.

Thankyou my fellow sysadmins! I couldn’t do my job without you all doing yours.

RPM addsign fail on vendor-provided package (and a workaround)

We’ve been signing RPM packages in local repos for a while now, and this has been working nicely (see previous posts about rpm signing)… until today.

The Intel Fortran 2015 installer provides RPMs which are already signed by Intel and which install and work fine, so we push these out with Puppet from our local (private) repo. However, even though I’d signed them myself they were failing to verify the signature…

Original RPM before re-signing locally:

  => rpm --checksig intel-fcompxe-187-15.0-3.noarch.rpm
  intel-fcompxe-187-15.0-3.noarch.rpm: RSA sha1 ((MD5) PGP) md5 NOT OK (MISSING KEYS: (MD5) PGP#7a5a985f) 

  => rpm -qpi intel-fcompxe-187-15.0-3.noarch.rpm
  warning: intel-fcompxe-187-15.0-3.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID 7a5a985f: NOKEY
  Name        : intel-fcompxe-187            
  ...
  Signature   : RSA/SHA1, Fri 10 Apr 2015 13:23:59 BST, Key ID 27fbcd8d7a5a985f

So I go ahead and sign the package:

  => rpm --addsign intel-fcompxe-187-15.0-3.noarch.rpm 
  Enter pass phrase: [type passphrase]
  Pass phrase is good.
  intel-fcompxe-187-15.0-3.noarch.rpm:

All looks fine, so lets check the signature again:

  => rpm --checksig intel-fcompxe-187-15.0-3.noarch.rpm
  intel-fcompxe-187-15.0-3.noarch.rpm: RSA RSA sha1 sha1 ((MD5) PGP) ((MD5) PGP) md5 md5 NOT OK (MISSING KEYS: (MD5) PGP#7a5a985f (MD5) PGP#262a742e) 

  => rpm -qpi intel-fcompxe-187-15.0-3.noarch.rpm
warning: intel-fcompxe-187-15.0-3.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID 7a5a985f: NOKEY
  Name        : intel-fcompxe-187            
  ...
  Signature   : RSA/SHA1, Fri 10 Apr 2015 13:23:59 BST, Key ID 27fbcd8d7a5a985f

Odd, it hasn’t changed… Let’s try removing the signature instead:

=> rpm --delsign intel-fcompxe-187-15.0-3.noarch.rpm 
intel-fcompxe-187-15.0-3.noarch.rpm:

=> rpm --checksig intel-fcompxe-187-15.0-3.noarch.rpm
intel-fcompxe-187-15.0-3.noarch.rpm: RSA RSA sha1 sha1 sha1 ((MD5) PGP) ((MD5) PGP) md5 md5 md5 NOT OK (MISSING KEYS: (MD5) PGP#7a5a985f (MD5) PGP#262a742e) 

That’s very odd, it’s added tags to the signature header. And if you try a few more times (just to be sure, right? :), it adds more tags to the header:

=> rpm --delsign intel-fcompxe-187-15.0-3.noarch.rpm 
intel-fcompxe-187-15.0-3.noarch.rpm:
...
Packager    : http://www.intel.com/software/products/support
Summary     : Intel(R) Fortran Compiler XE 15.0 Update 3 for Linux*
Description :
Intel(R) Fortran Compiler XE 15.0 Update 3 for Linux*

=> rpm --checksig intel-fcompxe-187-15.0-3.noarch.rpm
intel-fcompxe-187-15.0-3.noarch.rpm: RSA RSA sha1 sha1 sha1 sha1 sha1 sha1 ((MD5) PGP) ((MD5) PGP) md5 md5 md5 md5 md5 md5 NOT OK (MISSING KEYS: (MD5) PGP#7a5a985f (MD5) PGP#262a742e) 

If you do this a few more times then rpm can’t read the package at all anymore!

  => rpm --checksig intel-fcompxe-187-15.0-3.noarch.rpm
  error: intel-fcompxe-187-15.0-3.noarch.rpm: rpmReadSignature failed: sigh tags: BAD, no. of tags(33) out of range

  => rpm -ql -v -p intel-fcompxe-187-15.0-3.noarch.rpm 
  error: intel-fcompxe-187-15.0-3.noarch.rpm: rpmReadSignature failed: sigh tags: BAD, no. of tags(33) out of range
  error: intel-fcompxe-187-15.0-3.noarch.rpm: not an rpm package (or package manifest)

Ooops!

Workaround: Add the vendor keys as you should do, rather than re-signing. Appending the public key to the required RPM-GPG_KEY-* file is all that’s required, and then you can install the packages just fine.

Future work: Submit bug report about this to the rpm-sign developers…

How old is this solaris box?

Sometimes it’s useful to know how old a solaris server is, without having to dig out its serial number or documentation.

Turns out it’s really easy. “prtfru -c” will give you the build date of various bits of hardware in the system. For example, here’s a server that we’ve just retired (Which is long overdue!)

oldserver:$ sudo prtfru -c | grep UNIX_Timestamp
Password:
      /ManR/UNIX_Timestamp32: Mon Aug 22 02:52:32 BST 2005
      /ManR/UNIX_Timestamp32: Fri Jun  3 19:48:16 BST 2005
      /ManR/UNIX_Timestamp32: Wed Aug  3 11:39:47 BST 2005
      /ManR/UNIX_Timestamp32: Fri Jun  3 19:46:50 BST 2005
oldserver:$ 

Capacity Planning for DNS

I’ve spent the last 6 months working on our DNS infrastructure, wrangling it into a more modern shape.

This is the first in a series of articles talking about some of the process we’ve been through and outlining some of the improvements we’ve made.

One of the exercises we try to go through when designing any new production infrastructure is capacity planning. There are four questions you need to be able to ask when you’re doing this:

  1. How much traffic do we need to handle today?
  2. How are we expecting traffic to grow?
  3. How much traffic can the infrastructure handle?
  4. How much headroom have we got?

We aim to be in a position where we can ask those four questions on a regular basis, and preferably get useful answers to them!

When it comes to DNS, the most useful metric would appear to be “queries/second” (which I’ll refer to as qps from here on in to save a load of typing!) and bind can give us that information fairly readily with it’s built in statistics gathering features.

With that in mind, lets look at those 4 questions.

1. How much traffic do we need to handle today?
The best way to get hold of that information is to collect the qpsĀ metrics from our DNS infrastructure and graph them.

This is quite a popular thing to do and most monitoring tools (eg nagios, munin or ganglia) have well worn solutions available, and for everything else there’s google

Unfortunately we weren’t able to collate these stats from the core of the legacy DNS infrastructure in a meaningful way (due to differences in bind version, lack of a sensible aggregation point etc)

Instead, we had to infer it from other sources that we can/do monitor, for example the caching resolvers we use for eduroam.

Our eduroam wireless network is used by over 30,000 client devices a week. We think this around 60% of the total devices on the network, so it’s a fairly good proxy for the whole university network.

We looked at what the eduroam resolvers were handling at peak time (revision season), doubled it and added a bit. Not a particularly scientific approach, but it’s likely to be over-generous which is no bad thing in this case!

That gave us a ballpark figure of “we need to handle around 4000qps”

2. How are we expecting traffic to grow?
We don’t really have long term trend information for the central DNS service due to the historical lack of monitoring.

Again inferring generalities from eduroam, the number of clients on the network goes up by 20-30% year on year (and has done since 2011) Taking 30% growth year on year as our growth rate, and expanding that over 5 years it looks like this:

dns growth

Or in 5 years time we think we’ll need around 15,000qps.

All estimates in this process being on the generous side, and due to the compound nature of the year-on-year growth calculations, that should be a significant overestimate.

It will certainly be an interesting figure to revisit in 5 years time!

3. How much traffic can the infrastructure handle?
To answer this one, we need some benchmarking tools. After a bit of research I settled on dnsperf. The mechanics of how to run dnsperf (and how to gather a realistic sample dataset) are best left for another time.

All tests were done against the pre-production infrastructure so as not to interfere with live traffic.

Lets look at the graphs we get out at the end.

The new infrastructure:
20150624-1225.rate

Interpreting this graph isn’t immediately obvious. The way dnsperf works is that it linearly scales the number of queries/second that it’s sending to your DNS server, and tracks how many responses it gets back per second.

So the red line is how many queries/second we’re testing against, and the green line is how the server is responding. Where the two lines diverge shows you where your infrastructure starts to struggle.

In this case, the new infrastructure appears to cope quite well with around 30,000qps – or about twice what we’re expecting to need in 5 years time. That’s with all (or rather, both!) the servers in the pool available, so do we still have n+1 redundancy?

A single node in the new infrastructure:
20150622-1438.rate

From this graph you can see we’re good up to around 14000qps, so we’re n+1 redundant for at least the next 3-4 years (the lifetime of the harware we’re using)

At present, we have 2 nodes in the pool, the implication from the two graphs is that it does indeed scale approximately linearly with the number of servers in the pool.

4. How much headroom have we got?
At this point, the answer to that looks like “plenty” and with the new infrastructure we should be able to scale out almost linearly by adding more servers to the pool.

Now that we know how much we can expect our infrastructure to handle, and how much it’s actually experiencing, we can make informed decisions about when we need add more resources in order to maintain at least n+1 redundancy.

What about the legacy infrastructure?
Well, the reason I’m writing this post today (rather than any other day) is that we retired the oldest of the servers in the legacy infrastructure today, and I wanted to fire dnsperf at it, after it’s stopped handling live traffic but before we switch it off completely!

So how many queries/second can a 2005 vintage Sun Fire V240 server cope with?

20150727-0938.rate

It seems the answer to that is “not really enough for 2015!”

No wonder its response times were atrocious…

F5 Big-IP, Apache logs and client IP

Due to the fact that the F5 Big-IP make use of SNAT to load balance traffic, your back-end node will see the traffic coming from the IP of the load balancer and not the true client.
To overcome this (for web traffic at least), the F5 injects the X-Forwarded-For header in to HTTP steams with the true clients IP.

In apache you may want to log this IP instead of the remote host if it has been set. Using the SetEnvIf, we can produce a suitable LogFormat line based on if the X-Forwarded-For header is set or not:

CustomLog "/path/o/log/dir/example.com_access.log" combined env=!forwarded
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\"" proxy
SetEnvIf X-Forwarded-For "^.*\..*\..*\..*" forwarded
CustomLog "/path/to/log/dir/example.com_access.log" proxy env=forwarded

The above assumes that the “combined” LogFormat has already been defined.

If you use the ::apache::vhost puppet class from the puppetlabs/apache module, you can achieve the same result with the following parameters:

::apache::vhost { 'example.com':
  logroot => "/path/to/log/dir/",
  access_log_env_var => "!forwarded",
  custom_fragment => "LogFormat \"%{X-Forwarded-For}i %l %u %t \\\"%r\\\" %s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\"    proxy
  SetEnvIf X-Forwarded-For \"^.*\..*\..*\..*\" forwarded
  CustomLog \"/path/to/log/dir/${title}_access.log\" proxy env=forwarded"
}