Services being broken by automatically installing bad updates from the package manager is an issue that sysadmins have been mulling over for years. The knee-jerk reaction is to disable automatic updates, but this doesn’t avoid the problem. You still have to apply the updates sooner or later, and doing updates manually is ever more tedious as admins look after larger and larger fleets of virtual machines. Not patching at all is also not an option, for obvious security reasons and also for compliance with ISP-11 which mandates that security updates must be applied within 5 days.
The “gold standard” solution that everyone dreams of is to use a gated repo – i.e. you have a local “dirty” repo which syncs from upstream every night, and all of your dev/test machines update against this repo. Once “someone” has done “some” testing, they can allow the known-good updates into the clean repo, which the production machines update from. The trouble is, this is actually quite a lot of work to set up, and requires ongoing maintenance to test updates. It’s still very human-intensive, and sysadmins hate that.
So what else can be done? People often talk about a system where your dev servers patch nightly and your prod servers patch weekly, but this doesn’t help you if the broken update comes out on prod patching day.
We’ve recently started using a similar regime where dev/test servers patch nightly, but there’s a tweak with prod. All our prod servers are split into a class (dev, test, prod, etc) and one of three groups according to the numbers in their structured hostname. We classify nodes using a Puppet module called uob_classifier. Here’s an example:
[ispms@db-mariadb-p0 ~]$ sudo facter -p uob_service_class prod [ispms@db-mariadb-p0 ~]$ sudo facter -p uob_service_tier 1
Anything that for some reason slips through the automatic classification isn’t forgotten – the classifier assumes it is production group 1.
We then use these classes and groups to decide when to patch each type of box. Here’s the Puppet profile we use to manage yum updates with a Forge module, jgazeley/yumupdate.
# Profile for yum updates class profile::yumupdate { # If it's prod, work out what day we should patch # to keep clusters patched on different days if ($::uob_service_class == 'prod') { $prodday = $::uob_service_tier ? { '1' => ['1'], '2' => ['2'], '3' => ['3'], default => ['4'], } } # Work out what day(s) to patch for non-prod nodes $weekday = $::uob_service_class ? { 'prod' => $prodday, 'dev' => ['1-5'], 'test' => ['1-5'], 'beta' => ['1-5'], 'demo' => ['1-5'], default => ['1-5'], } # jgazeley/yumupdate class { '::yumupdate': weekday => $weekday, } }
In our environment, production groups 1-3 patch on Mondays, Tuesdays or Wednesdays respectively. Anything that is designated production but somehow didn’t get a group number patches on Thursdays. Dev/test/etc servers patch every weekday.
We don’t know what day of the week patches will be released upstream, so this is still no guarantee that dev servers will install a particular patch before production. However it does mean that not all production servers will break on the same day if there is a bad update – we will have enough time to halt further updates if one server breaks due to a bad update.
It’s certainly not a perfect solution, but it avoids most of the risk surrounding automatic patching and is also very quick and simple to implement.
I wonder if it would be possible for a script to check versions of installed packages on the dev/test machine, note when the version changes and forward the corresponding package name 5 days later to the prod server asking it to update the package?
The problem with that approach is, we don’t have just one dev/tes machine, and we don’t have just one production machine. It becomes a “many to many” communication problem, and that’s not trivial to solve.
Packages which behave “ok” on one class of test machine (eg a webserver) might not be safe for deployment on a different type of production server (eg a database server) so your “many to many” problem gets more complicated as you have to replicate all the relationships between dev and production.
Also with a per-package, static 5 day window between deployment to test and deployment to prod, you never really know what state your environment is in.
Sure you could do all that, and you could use some form of messagebus to do the communications, you could have production only updating the packages that dev has told it are “safe”…
but that all sounds like a lot of engineering work to mitigate the last 20% of the problem when a simplistic approach like the one describes gets you 80% of the way there.