Getting to grips with GitLab CI

Background

Continuous Integration (CI) refers to the concept of automatically testing, building and deploying code as often as possible. This concept has been around in the world of software development for some time now, but it’s new to sysadmins like me.

While the deliverables produced by developers might be more tangible (a mobile app, a website, etc), with the rise of infrastructure as code, sysadmins and network admins are increasingly describing the state of their systems as code in a configuration management system. This is great, as it enables massive automation and scaling. It also opens the door for a more development-like workflow, including some of the tools and knowledge used by developers.

This article describes our progress using a CI workflow to save time, improve quality and reduce risk with our day-to-day infrastructure operations.

Testing, testing…

The Wireless team have used the Puppet configuration management system for several years, for managing server infrastructure, deploying applications and the suchlike. We keep our code in GitLab and do our best to follow best practice when branching/merging. However, one thing we don’t do is automatic testing. When a branch is ready for merging we test manually by moving a test server into that Puppet environment, and seeing if it works properly.

GitLab CI

The IT Services GitLab server at git.services.bristol.ac.uk now provides the GitLab CI service, which at its simplest is a thing that executes a script against your repository to check some properties of it. I thought I would start off simple and write some CI tests to be executed against our Puppet repo to do syntax checking. There are already tools that can do the syntax checking (such as puppet-lint), so all I need to do is write a CI test that executes them.

There’s a snag, though. What is going to execute these tests, and where? How are we going to ensure the execution environment is suitable?

GitLab CI runs on the GitLab server itself, but it executes CI tests in CI runners. Runners can be hosted on the GitLab server, on a different server or in the cloud. To start off simple, I created a new VM to host a single CI runner. So far so good, but the simplest possible runner configuration simply executes the CI tests in a shell on the system it is running on. Security concerns aside, this is also a bad idea because the only environment available is the one the runner is hosted on, and what if a CI test changes the state of the environment? Will the second test execute in the same way?

Docker

This is where Docker steps in. Docker is a container platform which has the ability to create and destroy lightweight, yet self-contained containers on demand. To the uninitiated, you could kind-of, sort-of think of Docker containers as VMs. GitLab CI can make use of Docker containers to execute CI tests. Each CI test is executed in a factory-fresh Docker container which is destroyed after the test has completed, so you can be sure of consistent testing, and it doesn’t matter if you accidentally break the container. The user can specify which Docker image to use for each test.

A real example

So far, this is all talk. Let me show you the components of the simple CI tests I’ve written for our Puppet control repo.

The CI config itself is stored in the root of your git repo, in a file called.gitlab-ci.yml. The presence of this file magically enables CI pipelines in your project. The file tells GitLab CI how to find a runner, which Docker image to use and what tests to execute. Let’s have a look at the config file we’re using for our Puppet repo:

# Docker image to use for these tests
image: git.services.bristol.ac.uk:4567/resnet/netops-ci:master

# Different stages in which to run tests. Only proceed to the
# next stage if the current one passes
stages:
  # check: syntax checking
  - check
  # style: linting
  - style

# Check Puppet syntax
puppet-parser:
  stage: check
  script:
    - tests/check-puppet-parser.sh
  only:
    - branches

# Check ERB template syntax
check-erb:
  stage: check
  script:
    - tests/check-erb.sh
  only:
    - branches

# Check YAML (Hiera) syntax
check-yaml:
  stage: check
  script:
    - tests/check-yaml.sh
  only:
    - branches

# Check Puppet linting style
puppet-lint:
  stage: style
  script:
    - tests/style-puppet-lint.sh
  only:
    - branches

All of the tests are executed in the same way: by calling shell scripts that are in the tests subdirectory of the repo. They have been sorted into two stages – after all, there’s no point in proceeding to run style checks if the syntax isn’t valid. Each one of these tests runs in its own Docker container without fear of contamination.

To give an idea of how simple these CI test scripts are, here’s the one we use to check Puppet syntax – it’s just a one-liner that finds all Puppet manifests in the repo and executes puppet parser validate against each one:

#!/bin/bash
set -euo pipefail

find . -type f -name '*.pp' -print0 | xargs -0 /opt/puppetlabs/bin/puppet parser validate

How CI fits with our workflow

In the configuration we are using, the test suite is executed against the codebase for every commit on every branch. It can also be configured only to run when tags are created, or only on the master branch, etc. For us, this decision is a reflection that we are using an interpreted language, there is no “build” stage and that every branch in the repo becomes a live Puppet environment.

The tests are always run in the background and if they succeed, you get a little green tick at various places throughout the GitLab interface to show you that your commit, branch or merge request is passing (has passed the most recent test).

Project summary showing CI status OK

If, however, you push a bad commit that fails testing then you get an email, and all the green ticks turn to red crosses. You can drill down into the failed pipeline, see which specific tests failed, and what errors they returned.

Failed tests

If you carry on regardless and create a merge request for a branch that is failing tests, it won’t let you accept that merge request without a dire warning.

Merge request which failed CI tests

Combining the CI pipeline with setting your master or production branch to be a protected branch means it should be impossible to merge code that has syntax errors. Pretty cool, and a great way of decreasing risk when merging code to production.

I want to play!

Hopefully this article has shown how easy it is to get started running basic CI tests on GitLab CI with Docker. To make things even easier, I have created a repository of sample GitLab CI configs and tests. Have a wander over to the gitlab-ci repo and look at the examples I’ve shared. At the time of writing, there are are configs and tests suitable for doing syntax checks on Puppet configs, Perl/Python/Ruby/Shell scripts and Dockerfiles.

The repo is open to all IT Services staff to read and contribute to, so please do share back any useful configs and tests you come up with.

N.B At the time of writing, the GitLab CI service is provided by a small VM as a proof of concept so tests may be slow if too many people jump on this cool bandwagon. We are in the process of acquiring some better hardware to host CI runners.

As ever, we recommend all GitLab users join the #gitlab-users channel on Slack for informal support and service notifications.

Looking ahead

These CI tests are a simple example of using Docker containers to execute trivial tests and return nothing but an error code. In the future we will be looking to create more complex CI pipelines, including:

  • Functional tests, which actually attempt to execute the code and make sure it works as designed rather than just checking the syntax
  • Tests that return artefacts, such as a pipeline that returns RPMs after running rpmbuild to build them
  • Tests that deploy the end product to a live environment after testing it, rather than just telling a human operator that it’s safe to deploy

Rocks and /install directory on compute nodes

I noticed recently that some of our compute nodes were getting a bit short on disk space, as we have a fairly large set of (multiple versions of) large applications installed into /opt on each node rather than sharing those applications over NFS via /share/apps (which is apparently the usual Rocks way).

In looking at this I noticed that the /install directory on each compute node contained a complete copy of all packages installed during rebuild! In our case the /install directory was using about 35GB, mostly because of multiple versions of Matlab and Mathematica being installed (which are up to 10GB each now…).

Anyhow, to avoid this you can convert from using the <packages> XML tags in your extend-compute.xml file to using a post-install script (the <post> section) which calls yum install explicitly for the packages you want to install.
Be sure to also run yum clean packages regularly in the script, otherwise you’re just moving the packages from /install into the yum package cache in /var/cache/yum !

e.g. Convert this:

<package>opt-matlab-R2016a</package>

..into this:

<post>
yum install -y opt-matlab-R2016a
yum clean packages
</post>

This has allowed us to continue installing packages locally until we use up that extra 35GB 🙂

Users with UIDs < 1000 : workarounds and how RHEL 7.2 breaks some of these hacks

When RHEL/CentOS 7.2 was released there was a change in PAM configs which authconfig generates.
For most people this won’t have made any difference, but if you occasionally use entries in /etc/passwd to override user information from other sources (e.g. NIS, LDAP) then this can bite you.

The RHEL bug here shows the difference and discussion around it, which can be summarised in the following small change.

In CentOS 7.1 you see this line in the PAM configs:

auth sufficient pam_unix.so nullok try_first_pass

…whilst in 7.2 it changes to this:

auth [success=done ignore=ignore default=die] pam_unix.so nullok try_first_pass

The difference here is that the `pam_unix` entry essentially changes from (in PAM terms) “sufficient” to “required”, and any failure there means that authentication is denied.

“Well, my users are in LDAP so this won’t affect me!”

Probably, but if you happen to add an entry to /etc/passwd to override something for a user, such as their shell, home directory or (in this case) their UID (yes, hacky I know, but old NIS habits die hard…):

testuser:*:12345:12345:Test User:/home/testuser:/bin/bash

..then this means that your user is defined as being a target for the pam_unix module (since it’s a local user defined in the local passwd file), and from 7.2 you hit that modified pam_unix line and get auth failures. In 7.1 you’d get entries in the logs saying that pam_unix denied access but it would continue on through the subsequent possibilities (pam_ldap, pam_sss or whatever else you have in there) and check the password against those.

The bug referenced above suggests a workaround of using authconfig --enablenis, as this happens to set the pam_unix line back to the old version, but that has a bunch of other unwanted effects (like enabling NIS in /etc/nsswitch.conf).

Obviously the real fix for our particular case is to not change the UID (which was a terrible hack anyway) but to reduce the UID_MIN used in /etc/logins.def to below the minimum UID required, and hope that there aren’t any clashes between users in LDAP and users which have already been created by packages being added (possibly ages ago…).

Hopefully this saves someone else some trouble with this surprise change in behaviour when upgrades are applied to existing machines and authconfig runs!

Some additional notes:

  • This won’t change until you next run authconfig, which in this case was loooong after the 7.2 update…
  • Not recommended : Putting pam_sss or pam_ldap before pam_unix as a workaround for the pam_unix failures in 7.2. Terrible problems happen with trying to use local users (including root!) if the network goes away.
  • Adding the UID_MIN change as early as possible is a good idea, so in your bootstrap process would be sensible, to avoid package-created users getting added with UIDs near to 1000.
  • These Puppet modules were used in creation of this blog post:
    • joshbeard/login_defs 0.2.0
    • sgnl05/sssd 0.2.1

Merging SELinux policies

We make extensive use of SELinux on all our systems. We manage SELinux config and policy with the jfryman/selinux Puppet module, which means we store SELinux policies in plain text .te format – the same format that audit2allow generates them in.

One of our SELinux policies that covers permissions for NRPE is a large file. When we generate new rules (e.g. for new Nagios plugins) with audit2allow it’s a tedious process to merge the new rules in by hand and mistakes are easy to make.

So I wrote semerge – a tool to merge SELinux policy files with the ability to mix and match stdin/stdout and reading/writing files.

This example accepts input from audit2allow and merges the new rules into an existing policy:

cat /var/log/audit/audit.log | audit2allow | semerge -i existingpolicy.pp -o existingpolicy.pp

And this example deduplicates and alphabetises an existing policy:

semerge -i existingpolicy.pp -o existingpolicy.pp

There are probably bugs so please do let me know if you find it useful and log an issue if you run into problems.

DNS Internals: delegating a subdomain to a server listening on a non-standard port

I’m writing this up because it took me quite some time to get my head around how to do this, and I found answers around the internet varying from “not possible” through to “try this” (which didn’t work) and “switch off this security feature you really like having” (no)

I found a way to make it happen, but it’s not easy. I’ll walk you through the problem, and how each way I attempted to solve it failed.

All the names below are hypotheticals, and for the sake of argument we’re trying to make “foo.subdomain.local” resolve via the additional server.

Problem:
Suppose you have two DNS servers. One which we’ll call “NS1” and one which we’ll call “NS-NEW”.

  • NS1 is a recursive server running bind, which all your clients point at to get their DNS information. It’s listening on port 53 as standard.
  • NS-NEW is an authoritative server which is listening on a non-standard port (8600) and for these purposes it’s a black box, we can’t change its behaviour.

You want your clients to be able to resolve the names that NS-NEW is authoritative for, but you don’t want to have to reconfigure the clients. So NS1 needs to know to pass those queries on to NS-NEW to get an answer.

Attempt 1 – “slave zone”
My first thought was to configure NS1 to slave the zone from NS-NEW.

zone "subdomain.local" {
        type slave;
        file "/var/named/slave/priv.zone";
        masters { $IP_OF_NS-NEW port 8600; };
};

This didn’t work for me because NS-NEW isn’t capable of doing zone transfers. Pity, as that would have been really neat and easy to manage!

Attempt 2 – “forward zone”
Then I tried forwarding queries from NS1 to NS-NEW, by using binds “forward zone” features.

zone "subdomain.local" {
        type forward;
        forward only;
        forwarders { $IP_OF_NS-NEW port 8600; };
};

This didn’t work because NS1 is configured to check for valid DNSSEC signatures. The root zone says that all its children are signed, and bind takes that to mean that all the grandchildren of the root should be signed as well.

The software running on NS-NEW isn’t capable of signing its zone information.

It doesn’t appear to be possible to selectively turn off DNSSEC checking on a per-zone basis, and I didn’t want to turn that off for our whole infrastructure as DNSSEC is generally a Good Thing.

Attempt 3 – “delegation”
I did think I could probably work around it by making NS1 authoritative for the “local.” top level domain, then using NS records in the zonefile for “local.” to directly delegate the zone to NS-NEW.

Something like this:

$TTL 86400	; default TTL for this zone
$ORIGIN local.
@       IN  SOA  NS1.my.domain. hostmaster.my.domain. (
                     2016031766 ; serial number
                     28800      ; refresh
                     7200       ; retry
                     604800     ; expire
                     3600       ; minimum
                     )
        IN  NS  NS1.my.domain.

; delegated zones
subdomain  IN  NS NS-NEW.my.domain.

Unfortunately that doesn’t work either, as it’s not possible to specify a port number in an NS record, and NS-NEW isn’t listening on a standard port.

Attempt 3 – “a little of option 2 and a little of option 3”
Hold on to your hats, this gets a little self referential.

I made NS1 authoritative for “local.”

zone "local" {
        type master;
        file "/var/named/data/zone.local";
};

I configured NS records in the “local.” zone file, which point back at NS1

$TTL 86400	; default TTL for this zone
$ORIGIN local.
@       IN  SOA  NS1.my.domain. hostmaster.my.domain. (
                     2016031766 ; serial number
                     28800      ; refresh
                     7200       ; retry
                     604800     ; expire
                     3600       ; minimum
                     )
        IN  NS  NS1.my.domain.

; delegated zones
subdomain  IN  NS NS1.my.domain.

I then configured a “subdomain.local.” forward zone on NS1 which forwards queries on to NS-NEW

zone "subdomain.local" {
        type forward;
        forward only;
        forwarders { $IP_OF_NS-NEW port 8600; };
};

To understand why this works, you need to understand how the recursion process for a query like “foo.subdomain.local.” happens.

When the query comes in NS1 does this:
– do I already know the answer from a previously cached query? Let’s assume no for now.
– do I know which DNS server is responsible for “subdomain.local.” from a previously cached query? Lets assume no for now.
– do I know which DNS server is responsible for “local.” – ooh! Yes! That’s me!
– now I can look in the zone file for “local.” and look to see how I resolve “subdomain.local.” – there’s an NS record which says I should ask NS1 in an authoritative way.
– now I ask NS1 for an answer to “foo.subdomain.local.”
– NS1 can then forward my query off to NS-NEW and fetch an answer.

Because we haven’t had to go all the way up to the root to get our answer, we avoid encountering the DNSSEC issue for this zone.

Did you really do it like *that*?
Yes and no.

The above is a simplified version of what I actually had to do, as our production equivalent of NS1 isn’t a single server – and I had to take account of our zone file management process, and all of that adds complexity which I don’t need to go into.

There are also a few extra hoops to jump through to make sure that the “local.” domain can only be accessed by clients on our network, and to make sure that our authoritative infrastructure doesn’t “leak” the “local.” zone to the outside world.

What would you have liked to have done?
If NS-NEW was able to listen on a standard port, I’d have used a straight delegation to do it.

If NS-NEW was able to sign it’s zone data with DNSSEC, I’d have used a simple forward zone to do it.

NS-NEW isn’t *quite* the black box I treated it as in this article, but the restriction about not being able to make it listen on port 53 is a real one.

The software running on NS-NEW does have a feature request in it’s issue tracker for DNSSEC, which I’ll watch with interest – as that would allow me to tidy up our config and might actually enable some other cool stuff further down the line…

Xen VM networking and tcpdump — checksum errors?

Whilst searching for reasons that a CentOS 6 samba gateway VM we run in ZD (as a “Fog VM” on a Xen hypervisor) was giving poor performance and seemingly dropping connections during long transfers, I found this sort of output from tcpdump -v host <IP_of_samba_client> on the samba server:

10:44:07.228431 IP (tos 0x0, ttl 64, id 64091, offset 0, flags [DF], proto TCP (6), length 91)
    server.example.org..microsoft-ds > client.example.org.44729: Flags [P.], cksum 0x3e38 (incorrect -> 0x67bc), seq 6920:6959, ack 4130, win 281, options [nop,nop,TS val 4043312256 ecr 1093389661], length 39SMB PACKET: SMBtconX (REPLY)

10:44:07.268589 IP (tos 0x0, ttl 60, id 18217, offset 0, flags [DF], proto TCP (6), length 52)
    client.example.org.44729 > server.example.org..microsoft-ds: Flags [.], cksum 0x1303 (correct), ack 6959, win 219, options [nop,nop,TS val 1093389712 ecr 4043312256], length 0

Note that the cksum field is showing up as incorrect on all sent packets. Apparently this is normal when hardware checksumming is used, and we don’t see incorrect checksums at the other end of the connection (by the time this shows up on the client).

However, testing has shown that the reliability and performance of connections to the samba server on this VM is much greater when hardware checksumming is disabled with:

ethtool -K eth0 tx off

Perhaps our version of Xen on the hypervisor, or a combination of the version of Xen and versions of drivers on the hypervisor and/or client VM are causing networking problems?

My networking-fu is weak, so I couldn’t say more than what I have observed, even though this shouldn’t really make a difference…
(Please comment if you have info on why/what may be going on here!)

Ubuntu bug 31273 shows others with similar problems which were solved by disabling hardware checksum offloading.

Molly-guard for CentOS 7?

Since I was looking at this already and had a few things to investigate and fix in our systemd-using hosts, I checked how plausible it is to insert a molly-guard-like password prompt as part of the reboot/shutdown process on CentOS 7 (i.e. using systemd).

Problems encountered include:

  • Asking for a password from a service/unit in systemd — Use systemd-ask-password and needs some agent setup to reply to this correctly?
  • The reboot command always walls a message to all logged in users before it even runs the new reboot-molly unit, as it expects a reboot to happen. The argument --no-wall stops this but that requires a change to the reboot command. Hence back to the original problem of replacing packaged files/symlinks with RPM
  • The reboot.target unit is a “systemd.special” unit, which means that it has some special behaviour and cannot be renamed. We can modify it, of course, by editing the reboot.target file.
  • How do we get a systemd unit to run first and block anything later from running until it is complete? (In fact to abort the reboot but just for this time rather than being set as permanently failed. Reboot failing is a bit of a strange situation for it to be in…) The dependencies appear to work but the reboot target is quite keen on running other items from the dependency list — I’m more than likely doing something wrong here!

So for now this is shelved. It would be nice to have a solution though, so any hints from systemd experts are greatfully received!

(Note that CentOS 7 uses systemd 208, so new features in later versions which help won’t be available to us)

Capacity Planning for DNS

I’ve spent the last 6 months working on our DNS infrastructure, wrangling it into a more modern shape.

This is the first in a series of articles talking about some of the process we’ve been through and outlining some of the improvements we’ve made.

One of the exercises we try to go through when designing any new production infrastructure is capacity planning. There are four questions you need to be able to ask when you’re doing this:

  1. How much traffic do we need to handle today?
  2. How are we expecting traffic to grow?
  3. How much traffic can the infrastructure handle?
  4. How much headroom have we got?

We aim to be in a position where we can ask those four questions on a regular basis, and preferably get useful answers to them!

When it comes to DNS, the most useful metric would appear to be “queries/second” (which I’ll refer to as qps from here on in to save a load of typing!) and bind can give us that information fairly readily with it’s built in statistics gathering features.

With that in mind, lets look at those 4 questions.

1. How much traffic do we need to handle today?
The best way to get hold of that information is to collect the qps metrics from our DNS infrastructure and graph them.

This is quite a popular thing to do and most monitoring tools (eg nagios, munin or ganglia) have well worn solutions available, and for everything else there’s google

Unfortunately we weren’t able to collate these stats from the core of the legacy DNS infrastructure in a meaningful way (due to differences in bind version, lack of a sensible aggregation point etc)

Instead, we had to infer it from other sources that we can/do monitor, for example the caching resolvers we use for eduroam.

Our eduroam wireless network is used by over 30,000 client devices a week. We think this around 60% of the total devices on the network, so it’s a fairly good proxy for the whole university network.

We looked at what the eduroam resolvers were handling at peak time (revision season), doubled it and added a bit. Not a particularly scientific approach, but it’s likely to be over-generous which is no bad thing in this case!

That gave us a ballpark figure of “we need to handle around 4000qps”

2. How are we expecting traffic to grow?
We don’t really have long term trend information for the central DNS service due to the historical lack of monitoring.

Again inferring generalities from eduroam, the number of clients on the network goes up by 20-30% year on year (and has done since 2011) Taking 30% growth year on year as our growth rate, and expanding that over 5 years it looks like this:

dns growth

Or in 5 years time we think we’ll need around 15,000qps.

All estimates in this process being on the generous side, and due to the compound nature of the year-on-year growth calculations, that should be a significant overestimate.

It will certainly be an interesting figure to revisit in 5 years time!

3. How much traffic can the infrastructure handle?
To answer this one, we need some benchmarking tools. After a bit of research I settled on dnsperf. The mechanics of how to run dnsperf (and how to gather a realistic sample dataset) are best left for another time.

All tests were done against the pre-production infrastructure so as not to interfere with live traffic.

Lets look at the graphs we get out at the end.

The new infrastructure:
20150624-1225.rate

Interpreting this graph isn’t immediately obvious. The way dnsperf works is that it linearly scales the number of queries/second that it’s sending to your DNS server, and tracks how many responses it gets back per second.

So the red line is how many queries/second we’re testing against, and the green line is how the server is responding. Where the two lines diverge shows you where your infrastructure starts to struggle.

In this case, the new infrastructure appears to cope quite well with around 30,000qps – or about twice what we’re expecting to need in 5 years time. That’s with all (or rather, both!) the servers in the pool available, so do we still have n+1 redundancy?

A single node in the new infrastructure:
20150622-1438.rate

From this graph you can see we’re good up to around 14000qps, so we’re n+1 redundant for at least the next 3-4 years (the lifetime of the harware we’re using)

At present, we have 2 nodes in the pool, the implication from the two graphs is that it does indeed scale approximately linearly with the number of servers in the pool.

4. How much headroom have we got?
At this point, the answer to that looks like “plenty” and with the new infrastructure we should be able to scale out almost linearly by adding more servers to the pool.

Now that we know how much we can expect our infrastructure to handle, and how much it’s actually experiencing, we can make informed decisions about when we need add more resources in order to maintain at least n+1 redundancy.

What about the legacy infrastructure?
Well, the reason I’m writing this post today (rather than any other day) is that we retired the oldest of the servers in the legacy infrastructure today, and I wanted to fire dnsperf at it, after it’s stopped handling live traffic but before we switch it off completely!

So how many queries/second can a 2005 vintage Sun Fire V240 server cope with?

20150727-0938.rate

It seems the answer to that is “not really enough for 2015!”

No wonder its response times were atrocious…

F5 Big-IP, Apache logs and client IP

Due to the fact that the F5 Big-IP make use of SNAT to load balance traffic, your back-end node will see the traffic coming from the IP of the load balancer and not the true client.
To overcome this (for web traffic at least), the F5 injects the X-Forwarded-For header in to HTTP steams with the true clients IP.

In apache you may want to log this IP instead of the remote host if it has been set. Using the SetEnvIf, we can produce a suitable LogFormat line based on if the X-Forwarded-For header is set or not:

CustomLog "/path/o/log/dir/example.com_access.log" combined env=!forwarded
LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\"" proxy
SetEnvIf X-Forwarded-For "^.*\..*\..*\..*" forwarded
CustomLog "/path/to/log/dir/example.com_access.log" proxy env=forwarded

The above assumes that the “combined” LogFormat has already been defined.

If you use the ::apache::vhost puppet class from the puppetlabs/apache module, you can achieve the same result with the following parameters:

::apache::vhost { 'example.com':
  logroot => "/path/to/log/dir/",
  access_log_env_var => "!forwarded",
  custom_fragment => "LogFormat \"%{X-Forwarded-For}i %l %u %t \\\"%r\\\" %s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\"    proxy
  SetEnvIf X-Forwarded-For \"^.*\..*\..*\..*\" forwarded
  CustomLog \"/path/to/log/dir/${title}_access.log\" proxy env=forwarded"
}

Puppet future parser — what to expect that you’ll have to update in your manifests…

The Puppet Future Parser is the new implementation of the manifest parser which will become the default in 4.0, so I thought I’d take a look to see what I’d need to update.

Also, there are some fancy new features like iteration and that you can use [1,2] array notation or {a=>b} hash notation anywhere that you’d previously used a variable containing an array or hash.

The iteration and lambda features are intended to replace create_resources calls, as they are more flexible and can loop round repeatedly to create individual definitions.

For example, here’s a dumb “sudo” profile which uses the each construct to iterate over an array:

class profiles::sudo {
  # This is a particularly dumb version of use of sudo, to allow any commands:
  $admin_users = hiera_array('admin_users')
  # Additional users with special sudo rights, but no ssh access (e.g. root):
  $sudo_users  = hiera_array('sudo_users')

  class { ::sudo: }

  $all_sudo_users = concat($sudo_users, $admin_users)

  # Create a resource for each entry in the array:
  each($all_sudo_users) |$u| {
    sudo::entry { $u:
      comment  => "Allow ${u} to run anything as any user",
      username => $u,
      host     => 'ALL',
      as_user  => 'ALL',
      as_group => 'ALL',
      nopasswd => false,
      cmd      => 'ALL',
    }
  }
}

Making this work with create_resources and trying to splice in the the username for each user in the list into a hash looked like it would be messy, requiring at least an additional layer of define — this method is much neater.

This makes it much easier to create data abstractions over existing modules — you can programmatically massage the data you read from your hiera files and call definitions using that data in a much more flexible way than when passing hashes to create_resources. This “glue” can be separated into your roles and profiles (which could be the subject of another post but are described well in this blog post), creating a layer which separates the use of the module from the data which drives that use nicely.

So this all sounds pretty great, but there are a few changes you’ll possibly encounter when switching to the future parser:

  • Similar to the switch from puppet master to puppet server, the future parser is somewhat more strict about data formats. e.g. I found that my hiera data definitely needed to be properly quoted when I started using puppet server, so entries like mode : 644 in a file hash wouldn’t give the number you were expecting… (needs mode : 0644 or mode : '644' to avoid conversion from octal to decimal…). The future parser extends this to being more strict in your manifests, so a similarly-incorrect file { ... mode => 644 } declaration needs quoting or a leading zero. If you use puppet-lint you’ll catch this anyway — so use it! 🙂
  • It’s necessary to use {} instead of undef when setting default values for hiera_hash (and likewise [] instead of undef for hiera_array), to allow conditional expressions of the form if $var { ... } to work as intended. It seems that in terms of falseness for arrays and hashes that undef is in fact true… (could be a bug, as this page in the docs says: “When used as a boolean, undef is false”)
  • Dynamically-scoped variables (which are pretty mad and difficult to follow anyway, which is why most languages avoid them like the plague…) don’t pass between a class and any sub-classes which it creates. This is in the docs here, but it’s such a common pattern that it could well have made it through from your old (pre-Puppet 2.7) manifests and still have been working OK until the switch to the future parser. e.g.:
    class foo {
      $var = "x"
    }
    
    class bar {
      include foo
      # $var isn't defined here, as dynamic scope rules don't allow it in Puppet >2.7
    }
    

    Instead you need to explicitly qualify your variables to pull them out of the correct scope — $foo::var in this case. In your erb templates, as a common place where the dynamically-scoped variables might have ended up getting used, you can now use scope['::foo::var'] as a shorthand for the previously-longer scope.lookupvar('::foo::var') to explicitly qualify the lookup of variables. The actual scope rules for Puppet < 2.7 are somewhat more complicated and often led to confusing situations if you unintentionally used dynamic scoping, especially when combined with overriding variables from the parent scope…

  • I’m not sure that expressions of the form if "foo" in $arrayvar { ... } work how they should, but I’ve not had a chance to investigate this properly yet.

Most of these are technically the parser more strictly adhering to the specifications, but it’s easy to have accidentally had them creep into your manifests if you’re not being good and using puppet-lint and other tools to check them.

In conclusion : Start using the Future Parser soon! It adds excellent features for iteration which make abstracting data a whole lot easier than using the non-future (past?) parser allows. Suddenly the combination of roles, profiles and the iteration facilities in the future parser mean that abstraction using Puppet and hiera makes an awful lot more sense!