Striving to be agile in a non-agile environment

I recently read a post over at the Government Data Services blog, which struck a chord with me. It was titled How to be agile in a non agile environment

I work in the Wireless/DNS/ResNet team. We’re a small team and as such we’re keen to adopt technologies or working practices which enable us to deliver as stable a service as possible while still being able to respond to users requirements quickly.

Over the last few years, we’ve converged on an approach which could be described as “agile with a small a”

We’re not software developers, we’re probably better described as an infrastructure operations team. So some of the concepts in Agile need a little translation, or don’t quite fit our small team – but we’ve cherry picked our way into something which seems to get a fair amount of bang-per-buck.

At it’s heart, the core beliefs of our approach are that shipping many small changes is less disruptive than shipping one big change, that our production environment should always be in a deployable state (even if that means it’s missing features) and that we collect metrics about the use of our services and use that to inform the direction we move in.

We’ve been using Puppet to manage our servers for almost 5 years now – so we’re used to “infrastructure as code” – we’ve got git (with gitlab) for our source code control, r10k to deploy ephemeral environments for developing/testing in, and a workflow which allows us to push changes through dev/test/production phases with every change being peer reviewed before it hits production.

However – we’re still a part of IT Services, and the University of Bristol as a whole. We have to work within the frameworks which are available, and play the same game as everyone else.

The University isn’t a particularly agile environment, it’s hard for an institution as big and as long established as a University to be agile! There are governance processes to follow, working groups to involve, stakeholders to inform and engage and standard tools used by the rest of the organisation which don’t tie in to our toolchain particularly nicely…

Using our approach, we regularly push 5-10 production changes a day in a controlled manner (not bad for a team of 2 sysadmins) with very few failures[1]. Every one of those changes is recorded in our systems with a full trail of who made what change, the technical detail of the change implementation and a record of who signed it off.

Obviously it’s not feasible to take every single one of those changes to the weekly Change Advisory Board, if we did the meeting would take forever!

Instead, we take a pragmatic approach and ask ourselves some questions for every change we make:

  • Will anyone experience disruption if the change goes well?
  • Will anyone experience disruption if the change goes badly?

If the answer to either of those questions is yes, then we ask ourselves an additional question. “Will anyone experience disruption if we *don’t* make the change”

The answers to those three questions inform our decision to either postpone or deploy the change, and help us to decide when it should be added to the CAB agenda. I think we strike about the right balance.

We’re keen to engage with the rest of the organisation as there are benefits to us in doing so (and well, it’s just the right thing to do!) and hopefully by combining the best of both worlds we can continue to deliver stable services in a responsive manner and still move the services forward.

I feel the advice in the GDS post pretty much mirrors what we’re already doing, and it’s working well for us.

Hopefully it could work well for others too!

[1] I say “very few failures” and I’m sure that probably scares some people – the notion that any failure of a system or change could be in any way acceptable.

I strongly believe that there is value in failure. Every failure is an opportunity to improve a system or a process, or to design in some missing resilience. Perhaps I’ll write more about that another time, as it’s a bit off piste from what I indented to write about here!

Government Digital Service (GDS) Tidbits

I’ve recently started reading the blog run by the Government Digital Service (the team behind .gov.uk) and there have been a handful of particularly interesting posts over the last week or so.

In a lot of ways, the GDS team are a similar organisation to some parts of IT Services, and rather like how we *could* be working in terms of agile/devops methodologies and workflows.  They’re doing a lot of cool stuff, and in interesting ways.

Most of it isn’t directly unix related (it’s mostly web application development) but when I spot something interesting I’ll try and flag it up here.  Starting with a couple of examples…

“Three cities, one alpha, one day”
http://digital.cabinetoffice.gov.uk/2013/07/23/three-cities-one-alpha-one-day/

This is a video they put up in a blog post about rapidly developing an alpha copy of the Land Registry property register service.  This rapid, iterative, user led development approach is an exciting way to build a service, and it’s interesting to watch another team go at something full pelt to see how far they can get in a day!

I’d have embedded it, but for some reason wordpress is stripping out the embed code…

FAQs: why we don’t have them
http://digital.cabinetoffice.gov.uk/2013/07/25/faqs-why-we-dont-have-them/

Another article which jumped out at me was this one about why FAQs are probably not the great idea we all thought they were.

Like many teams, the team I work in produces web based documentation about how to use our services, and yes, we’ve got our fair share of FAQs! Since I read this article, I’ve been thinking about working towards getting rid of them.

Instead of thinking about “what questions do we get asked a lot?” perhaps we really should be asking “why do people ask that question a lot?” and either eliminate the need for them to ask by making the service more intuative, or make it easier for them to find the answer themselves by changing the flow of information we feed them.

I doubt we can eliminate our FAQs entirely, they’re useful as a way of storing canned answers to problems outside our domain of control – eg for things like “how do I find out my wireless mac address?”  However if we can fix the root cause where the problem is within our domain – we reduce the list of items on our FAQ, which makes them clearer and easier to use if people do stumble across them – and still gives us a place to store our canned answers.

Ideally I think those canned answers would better live in a knowledgebase, or some kind of “smart answers” system.  Which brings me on to my last example…

The Smart Answer Bug
http://digital.cabinetoffice.gov.uk/2013/07/31/the-smart-answer-bug/

“Smart Answers” is an example of an expert system, which guides customers through a series of questions in order to narrow down their problem and offer them helpful advice quickly and easily.

The gov.uk smart answers system had a fault recently, which was only noticed because their analytics setup showed up some anomalous behaviour.

It turned out to be a browser compatibility fault with the analytics code, but the article really shows up the power of gathering performance and usage data about your services.

Although the example in the post is about web analytics, we can gather a lot of similar information about servers and infer similar information about service faults by analysing the results.

If we do that analysis well (and in an automated way) we can pick up faults before our users even notice a problem.