I recently read a post over at the Government Data Services blog, which struck a chord with me. It was titled How to be agile in a non agile environment
I work in the Wireless/DNS/ResNet team. We’re a small team and as such we’re keen to adopt technologies or working practices which enable us to deliver as stable a service as possible while still being able to respond to users requirements quickly.
Over the last few years, we’ve converged on an approach which could be described as “agile with a small a”
We’re not software developers, we’re probably better described as an infrastructure operations team. So some of the concepts in Agile need a little translation, or don’t quite fit our small team – but we’ve cherry picked our way into something which seems to get a fair amount of bang-per-buck.
At it’s heart, the core beliefs of our approach are that shipping many small changes is less disruptive than shipping one big change, that our production environment should always be in a deployable state (even if that means it’s missing features) and that we collect metrics about the use of our services and use that to inform the direction we move in.
We’ve been using Puppet to manage our servers for almost 5 years now – so we’re used to “infrastructure as code” – we’ve got git (with gitlab) for our source code control, r10k to deploy ephemeral environments for developing/testing in, and a workflow which allows us to push changes through dev/test/production phases with every change being peer reviewed before it hits production.
However – we’re still a part of IT Services, and the University of Bristol as a whole. We have to work within the frameworks which are available, and play the same game as everyone else.
The University isn’t a particularly agile environment, it’s hard for an institution as big and as long established as a University to be agile! There are governance processes to follow, working groups to involve, stakeholders to inform and engage and standard tools used by the rest of the organisation which don’t tie in to our toolchain particularly nicely…
Using our approach, we regularly push 5-10 production changes a day in a controlled manner (not bad for a team of 2 sysadmins) with very few failures[1]. Every one of those changes is recorded in our systems with a full trail of who made what change, the technical detail of the change implementation and a record of who signed it off.
Obviously it’s not feasible to take every single one of those changes to the weekly Change Advisory Board, if we did the meeting would take forever!
Instead, we take a pragmatic approach and ask ourselves some questions for every change we make:
- Will anyone experience disruption if the change goes well?
- Will anyone experience disruption if the change goes badly?
If the answer to either of those questions is yes, then we ask ourselves an additional question. “Will anyone experience disruption if we *don’t* make the change”
The answers to those three questions inform our decision to either postpone or deploy the change, and help us to decide when it should be added to the CAB agenda. I think we strike about the right balance.
We’re keen to engage with the rest of the organisation as there are benefits to us in doing so (and well, it’s just the right thing to do!) and hopefully by combining the best of both worlds we can continue to deliver stable services in a responsive manner and still move the services forward.
I feel the advice in the GDS post pretty much mirrors what we’re already doing, and it’s working well for us.
Hopefully it could work well for others too!
[1] I say “very few failures” and I’m sure that probably scares some people – the notion that any failure of a system or change could be in any way acceptable.
I strongly believe that there is value in failure. Every failure is an opportunity to improve a system or a process, or to design in some missing resilience. Perhaps I’ll write more about that another time, as it’s a bit off piste from what I indented to write about here!