Striving to be agile in a non-agile environment

I recently read a post over at the Government Data Services blog, which struck a chord with me. It was titled How to be agile in a non agile environment

I work in the Wireless/DNS/ResNet team. We’re a small team and as such we’re keen to adopt technologies or working practices which enable us to deliver as stable a service as possible while still being able to respond to users requirements quickly.

Over the last few years, we’ve converged on an approach which could be described as “agile with a small a”

We’re not software developers, we’re probably better described as an infrastructure operations team. So some of the concepts in Agile need a little translation, or don’t quite fit our small team – but we’ve cherry picked our way into something which seems to get a fair amount of bang-per-buck.

At it’s heart, the core beliefs of our approach are that shipping many small changes is less disruptive than shipping one big change, that our production environment should always be in a deployable state (even if that means it’s missing features) and that we collect metrics about the use of our services and use that to inform the direction we move in.

We’ve been using Puppet to manage our servers for almost 5 years now – so we’re used to “infrastructure as code” – we’ve got git (with gitlab) for our source code control, r10k to deploy ephemeral environments for developing/testing in, and a workflow which allows us to push changes through dev/test/production phases with every change being peer reviewed before it hits production.

However – we’re still a part of IT Services, and the University of Bristol as a whole. We have to work within the frameworks which are available, and play the same game as everyone else.

The University isn’t a particularly agile environment, it’s hard for an institution as big and as long established as a University to be agile! There are governance processes to follow, working groups to involve, stakeholders to inform and engage and standard tools used by the rest of the organisation which don’t tie in to our toolchain particularly nicely…

Using our approach, we regularly push 5-10 production changes a day in a controlled manner (not bad for a team of 2 sysadmins) with very few failures[1]. Every one of those changes is recorded in our systems with a full trail of who made what change, the technical detail of the change implementation and a record of who signed it off.

Obviously it’s not feasible to take every single one of those changes to the weekly Change Advisory Board, if we did the meeting would take forever!

Instead, we take a pragmatic approach and ask ourselves some questions for every change we make:

  • Will anyone experience disruption if the change goes well?
  • Will anyone experience disruption if the change goes badly?

If the answer to either of those questions is yes, then we ask ourselves an additional question. “Will anyone experience disruption if we *don’t* make the change”

The answers to those three questions inform our decision to either postpone or deploy the change, and help us to decide when it should be added to the CAB agenda. I think we strike about the right balance.

We’re keen to engage with the rest of the organisation as there are benefits to us in doing so (and well, it’s just the right thing to do!) and hopefully by combining the best of both worlds we can continue to deliver stable services in a responsive manner and still move the services forward.

I feel the advice in the GDS post pretty much mirrors what we’re already doing, and it’s working well for us.

Hopefully it could work well for others too!

[1] I say “very few failures” and I’m sure that probably scares some people – the notion that any failure of a system or change could be in any way acceptable.

I strongly believe that there is value in failure. Every failure is an opportunity to improve a system or a process, or to design in some missing resilience. Perhaps I’ll write more about that another time, as it’s a bit off piste from what I indented to write about here!

Xen VM networking and tcpdump — checksum errors?

Whilst searching for reasons that a CentOS 6 samba gateway VM we run in ZD (as a “Fog VM” on a Xen hypervisor) was giving poor performance and seemingly dropping connections during long transfers, I found this sort of output from tcpdump -v host <IP_of_samba_client> on the samba server:

10:44:07.228431 IP (tos 0x0, ttl 64, id 64091, offset 0, flags [DF], proto TCP (6), length 91)
    server.example.org..microsoft-ds > client.example.org.44729: Flags [P.], cksum 0x3e38 (incorrect -> 0x67bc), seq 6920:6959, ack 4130, win 281, options [nop,nop,TS val 4043312256 ecr 1093389661], length 39SMB PACKET: SMBtconX (REPLY)

10:44:07.268589 IP (tos 0x0, ttl 60, id 18217, offset 0, flags [DF], proto TCP (6), length 52)
    client.example.org.44729 > server.example.org..microsoft-ds: Flags [.], cksum 0x1303 (correct), ack 6959, win 219, options [nop,nop,TS val 1093389712 ecr 4043312256], length 0

Note that the cksum field is showing up as incorrect on all sent packets. Apparently this is normal when hardware checksumming is used, and we don’t see incorrect checksums at the other end of the connection (by the time this shows up on the client).

However, testing has shown that the reliability and performance of connections to the samba server on this VM is much greater when hardware checksumming is disabled with:

ethtool -K eth0 tx off

Perhaps our version of Xen on the hypervisor, or a combination of the version of Xen and versions of drivers on the hypervisor and/or client VM are causing networking problems?

My networking-fu is weak, so I couldn’t say more than what I have observed, even though this shouldn’t really make a difference…
(Please comment if you have info on why/what may be going on here!)

Ubuntu bug 31273 shows others with similar problems which were solved by disabling hardware checksum offloading.