Getting to grips with GitLab CI

Background

Continuous Integration (CI) refers to the concept of automatically testing, building and deploying code as often as possible. This concept has been around in the world of software development for some time now, but it’s new to sysadmins like me.

While the deliverables produced by developers might be more tangible (a mobile app, a website, etc), with the rise of infrastructure as code, sysadmins and network admins are increasingly describing the state of their systems as code in a configuration management system. This is great, as it enables massive automation and scaling. It also opens the door for a more development-like workflow, including some of the tools and knowledge used by developers.

This article describes our progress using a CI workflow to save time, improve quality and reduce risk with our day-to-day infrastructure operations.

Testing, testing…

The Wireless team have used the Puppet configuration management system for several years, for managing server infrastructure, deploying applications and the suchlike. We keep our code in GitLab and do our best to follow best practice when branching/merging. However, one thing we don’t do is automatic testing. When a branch is ready for merging we test manually by moving a test server into that Puppet environment, and seeing if it works properly.

GitLab CI

The IT Services GitLab server at git.services.bristol.ac.uk now provides the GitLab CI service, which at its simplest is a thing that executes a script against your repository to check some properties of it. I thought I would start off simple and write some CI tests to be executed against our Puppet repo to do syntax checking. There are already tools that can do the syntax checking (such as puppet-lint), so all I need to do is write a CI test that executes them.

There’s a snag, though. What is going to execute these tests, and where? How are we going to ensure the execution environment is suitable?

GitLab CI runs on the GitLab server itself, but it executes CI tests in CI runners. Runners can be hosted on the GitLab server, on a different server or in the cloud. To start off simple, I created a new VM to host a single CI runner. So far so good, but the simplest possible runner configuration simply executes the CI tests in a shell on the system it is running on. Security concerns aside, this is also a bad idea because the only environment available is the one the runner is hosted on, and what if a CI test changes the state of the environment? Will the second test execute in the same way?

Docker

This is where Docker steps in. Docker is a container platform which has the ability to create and destroy lightweight, yet self-contained containers on demand. To the uninitiated, you could kind-of, sort-of think of Docker containers as VMs. GitLab CI can make use of Docker containers to execute CI tests. Each CI test is executed in a factory-fresh Docker container which is destroyed after the test has completed, so you can be sure of consistent testing, and it doesn’t matter if you accidentally break the container. The user can specify which Docker image to use for each test.

A real example

So far, this is all talk. Let me show you the components of the simple CI tests I’ve written for our Puppet control repo.

The CI config itself is stored in the root of your git repo, in a file called.gitlab-ci.yml. The presence of this file magically enables CI pipelines in your project. The file tells GitLab CI how to find a runner, which Docker image to use and what tests to execute. Let’s have a look at the config file we’re using for our Puppet repo:

# Docker image to use for these tests
image: git.services.bristol.ac.uk:4567/resnet/netops-ci:master

# Different stages in which to run tests. Only proceed to the
# next stage if the current one passes
stages:
  # check: syntax checking
  - check
  # style: linting
  - style

# Check Puppet syntax
puppet-parser:
  stage: check
  script:
    - tests/check-puppet-parser.sh
  only:
    - branches

# Check ERB template syntax
check-erb:
  stage: check
  script:
    - tests/check-erb.sh
  only:
    - branches

# Check YAML (Hiera) syntax
check-yaml:
  stage: check
  script:
    - tests/check-yaml.sh
  only:
    - branches

# Check Puppet linting style
puppet-lint:
  stage: style
  script:
    - tests/style-puppet-lint.sh
  only:
    - branches

All of the tests are executed in the same way: by calling shell scripts that are in the tests subdirectory of the repo. They have been sorted into two stages – after all, there’s no point in proceeding to run style checks if the syntax isn’t valid. Each one of these tests runs in its own Docker container without fear of contamination.

To give an idea of how simple these CI test scripts are, here’s the one we use to check Puppet syntax – it’s just a one-liner that finds all Puppet manifests in the repo and executes puppet parser validate against each one:

#!/bin/bash
set -euo pipefail

find . -type f -name '*.pp' -print0 | xargs -0 /opt/puppetlabs/bin/puppet parser validate

How CI fits with our workflow

In the configuration we are using, the test suite is executed against the codebase for every commit on every branch. It can also be configured only to run when tags are created, or only on the master branch, etc. For us, this decision is a reflection that we are using an interpreted language, there is no “build” stage and that every branch in the repo becomes a live Puppet environment.

The tests are always run in the background and if they succeed, you get a little green tick at various places throughout the GitLab interface to show you that your commit, branch or merge request is passing (has passed the most recent test).

Project summary showing CI status OK

If, however, you push a bad commit that fails testing then you get an email, and all the green ticks turn to red crosses. You can drill down into the failed pipeline, see which specific tests failed, and what errors they returned.

Failed tests

If you carry on regardless and create a merge request for a branch that is failing tests, it won’t let you accept that merge request without a dire warning.

Merge request which failed CI tests

Combining the CI pipeline with setting your master or production branch to be a protected branch means it should be impossible to merge code that has syntax errors. Pretty cool, and a great way of decreasing risk when merging code to production.

I want to play!

Hopefully this article has shown how easy it is to get started running basic CI tests on GitLab CI with Docker. To make things even easier, I have created a repository of sample GitLab CI configs and tests. Have a wander over to the gitlab-ci repo and look at the examples I’ve shared. At the time of writing, there are are configs and tests suitable for doing syntax checks on Puppet configs, Perl/Python/Ruby/Shell scripts and Dockerfiles.

The repo is open to all IT Services staff to read and contribute to, so please do share back any useful configs and tests you come up with.

N.B At the time of writing, the GitLab CI service is provided by a small VM as a proof of concept so tests may be slow if too many people jump on this cool bandwagon. We are in the process of acquiring some better hardware to host CI runners.

As ever, we recommend all GitLab users join the #gitlab-users channel on Slack for informal support and service notifications.

Looking ahead

These CI tests are a simple example of using Docker containers to execute trivial tests and return nothing but an error code. In the future we will be looking to create more complex CI pipelines, including:

  • Functional tests, which actually attempt to execute the code and make sure it works as designed rather than just checking the syntax
  • Tests that return artefacts, such as a pipeline that returns RPMs after running rpmbuild to build them
  • Tests that deploy the end product to a live environment after testing it, rather than just telling a human operator that it’s safe to deploy

Migrating gitlab projects

If you’re migrating a gitlab project from one server to another, unless the two gitlab instances are the same major revision you may run into a couple of problems with the export/import procedure.

The first error you’re likely to hit is something like:

The repository could not be imported.
Error importing repository into pp-computing/todo-list - Import version mismatch: Required 0.1.8 but was 0.1.6

This is because whenever there is a potentially “dangerous” change to the import script, gitlab “fails safe” and refuses to import the project. If the two numbers are reasonably close together (and your project is straight forward enough that you can carefully check the users, permissions, wiki pages, issues and milestones etc then you can try this to pretend that your export tarball is newer than it really is:


mkdir project_export
tar xfv old_export_file.tar.gz -C project_export
cd project_export
echo '0.1.8' > VERSION
tar czf experimental_new_project_export.tar.gz *

If you have milestones in your project, you may hit another error if you’re migrating from a gitlab instance that is older than 9.5 is:


Error importing repository into my-group/my-project - Validation failed: Group milestone should belong either to a project or a group.

The workaround for this one appears to be to import your project into your personal gitlab space, and then “move” it to your group space.

If you hit any errors not covered in the above, let us know below!

(And don’t forget you’ll need to update your remotes in any checked out working copies you have!)

Rocks Clusters – the httpd update that breaks your cluster and how to fix it

I’ve had a cluster running Rocks 6.2 (Sidewinder) for a few months and it has been working well. I recently had a request to add a new user, so I created the account with a minimal useradd command specifying only the comment, the uid, the group and the username, then I ran the ‘rocks sync users’ command which copies various files, including /etc/passwd to the nodes and restarts some daemons.

A few hours later the user got back to me to say his jobs were queued, but not running. So I used the checkjob command to what the problem was, and found that his uid was unknown on the node. Indeed looking at the password file on the node, I saw that his account was not there. So I rebooted the node, and ran rocks sync users again, with no joy. So I set the node to rebuild on boot and rebooted it, and it came up with no user accounts at all.

There were errors like this in the log:

Jul 27 17:39:43 compute-0-8 411-alert-handler[13333]: Error: http://10.1.1.1:372/411.d/etc.auto..masterupdating Could not get file ‘http://10.1.1.1:372/411.d/etc.auto..master’: 400 Bad

The nodes get the password files amongst other things from the head node using the 411 service. So running the command below on the node should get all the files.

411get –all

however all I got was

Error: Could not get file ‘http://10.1.1.1:372/411.d//’: 400 Bad

I could ssh to a node and use wget to get the files successfully which caused me more confusion.

I had updated the head node recently, and this turned out to be my problem. I asked on the Rocks mailing list, and the answer I got was:

The latest CentOS 6 httpd update breaks 411.  To fix, add this to the
end of /etc/httpd/conf/httpd.conf and reload httpd:

HttpProtocolOptions Unsafe

So I did that, and now rocks sync users is working again. The version of http which caused the problem was httpd-2.2.15-60.el6.centos.4.x86_64

I’m putting this here in case anyone else gets hit by this.

Service availability monitoring with Nagios and BPI

Several times, senior management have asked Team Wireless to provide an uptime figure for eduroam. While we do have an awful lot of monitoring of systems and services, it has never been possible to give a single uptime figure because it needs some detailed knowledge to make sense of the many Nagios checks (currently 2704 of them).

From the point of view of a Bristol user on campus here, there are three services that must be up for eduroam to work: RADIUS authentication, DNS, and DHCP. For the purposes of resilience, the RADIUS service for eduroam is provided by 3 servers, DNS by 2 servers and DHCP by 2 servers. It’s hard to see the overall state of the eduroam service from a glance at which systems and services are currently up in Nagios.

Nagios gives us detailed performance monitoring and graphing for each system and service but has no built-in aggregation tools. I decided to use an addon called Business Process Intelligence (BPI) to do the aggregation. We built this as an RPM for easy deployment, and configured it with Puppet.

BPI lets you define meta-services which consist of other services that are currently in Nagios. I defined a BPI service called RADIUS which contains all three RADIUS servers. Any one RADIUS server must be up for the RADIUS group to be up. I did likewise for DNS and DHCP.

BPI also lets meta-services depend on other groups. To consider eduroam to be up, you need the RADIUS group and the DNS group and the DHCP group to be up. It’s probably easier to see what’s going on with a screenshot of the BPI control panel:

BPI control panel

BPI control panel

So far, these BPI meta-services are only visible in the BPI control panel and not in the Nagios interface itself. The BPI project does, however, provide a Nagios plugin check_bpi which allows Nagios to monitor the state of BPI meta-services. As part of that, it will draw you a table of availability data.

eduroam uptime

eduroam uptime

So now we have a definitive uptime figure to the overall eduroam service. How many nines? An infinite number of them! 😉 (Also, I like the fact that “OK” is split into scheduled and unscheduled uptime…)

This availability report is still only visible to Nagios users though. It’s a few clicks deep in the web interface and provides a lot more information than is actually needed. We need a simpler way of obtaining this information.

So I wrote a script called nagios-report which runs on the same host as Nagios and generates custom availability reports with various options for output formatting. As an example:

$ sudo /usr/bin/nagios-report -h bpi -s eduroam -o uptime -v -d
Total uptime percentage for service eduroam during period lastmonth was 100.000%

This can now be run as a cron job to automagically email availability reports to people. The one we were asked to provide is monthly, so this is our crontab entry to generate it on the first day of each month:

# Puppet Name: eduroam-availability
45 6 1 * * nagios-report -h bpi -s eduroam -t lastmonth -o uptime -v -d

It’s great that our work on resilience has paid off. Just last week (during the time covered by the eduroam uptime table) we experienced a temporary loss of about a third of our VMs, and yet users did not see a single second of downtime. That’s what we’re aiming for.

Making suexec work…

suexec is a useful way of getting apache to run interactive magic (cgi scripts, php scripts etc) with a different user/group than the one that apache is running as.

Most configuration guides tell you:

  • Add “SuexecUserGroup $OWNER $GROUP” to your apache config
  • Look in /var/log/httpd/suexec.log to see what’s going wrong

What they don’t tell you, is that suexec makes some assumptions about where to find things it will execute, and that you can’t guarantee that log location is consistent across distros (or even versions of the same distro). I’m setting this up on CentOS 7, so the examples below were produced in that environment.

You can get useful information about both of the above by running the following:

[myuser]$ sudo suexec -V
-D AP_DOC_ROOT="/var/www"
-D AP_GID_MIN=100
-D AP_HTTPD_USER="apache"
-D AP_LOG_SYSLOG
-D AP_SAFE_PATH="/usr/local/bin:/usr/bin:/bin"
-D AP_UID_MIN=500
-D AP_USERDIR_SUFFIX="public_html"
[myuser]$

AP_DOC_ROOT and AP_USERDIR_SUFFIX control which paths suexec will execute. In this case we’re restricted to running stuff that lives somewhere under “/var/www” or “/home/*/public_html”

If you’ve got content elsewhere (for example, an application which expects to be installed under /usr/share/foo/cgi-bin) then it’s not sufficient to put a symlink from /var/www/foo/cgi-bin to /usr/share/foo/cgi-bin as suexec checks the actual location of the file, not where it was called from.

This is sensible, as it stops you putting a symlink in place which points at something nasty like /bin/sh.

AP_GID_MIN and AP_UID_MIN limit which users/groups suexec will run stuff as. In this case it won’t run anything with a GID < 100 or a UID < 500. This is sensible as it stops you running CGI scripts as privileged system users.

The GID limit is probably not an issue, but the UID limit might cause wrinkles if you look after one of the 18 UoB users who have a centrally allocated unix UID that is under 500 (because they’ve been here since before that was a problem)[1]

AP_LOG_SYSLOG is a flag that says "send all log messages to syslog" – which is fine, and arguably an improvement over writing to a specific log file. It doesn’t immediately tell you where those messages end up, but I eventually found them in /var/log/secure… which seems a sensible place for them to end up.

Once you’ve got all that sorted, you’ll need to make selinux happy. Thankfully, that’s dead easy and can be done by enabling the httpd_unified boolean. If you’re using the jfryman/selinux puppet module, it’s as easy as:


selinux::boolean { 'httpd_unified': }

I think that’s all the bumps I’ve hit on this road so far, but if I find any more I’ll update this article.

-Paul
[1] but then again, if you look after them (or one of the 84 users whose UID is under 1000) you’re probably already used to finding odd things that don’t work!

RHEL 7.2 Authconfig follow up — don’t mix local user info with sssd!

Quick follow-up on my previous post about authconfig with more info.

So it turns out that this was intentional, and the change was made because 2-facter authententication support was added to SSSD.
This was added as a fix for RHEL bug 1204864, with the following comment:

With the current configuration pam_unix will always prompt the user for a password. Letting SSSD ask users of 2FA again for the password will lead to a bad user experience. Letting SSSD only ask for the second factor will make it hard for applications like gdm to show specific 2FA dialogs.

This means that if you use a mix of local (/etc/passwd or /etc/shadow) and remote (via sssd) user information for a particular user, then the user in question will only auth against their local password.
If they don’t have a local password then they will be unable to authenticate.

This seems a particularly odd thing to change during a point-release of RHEL, as I would expect that using a mix of local and remote user information is more common than using 2FA with sssd…

I thought this was worth stating separately from the previous post, as it’s more general than just when performing hackery to change UIDs — any local user entry will cause this to happen when used in conjunction with sssd.

Additional info:

The code in sssd which enforces this is as follows (from authconfig-6.2.8/authinfo.py in the current CentOS 7.x git sources, line 3812):

  # do not continue to following modules if authentication fails
  if name == "unix" and stack == "auth" and (self.enableSSSDAuth or
    self.implicitSSSDAuth or self.enableIPAv2) and (not self.enableNIS):
    logic = LOGIC_FORCE_PKCS11 # make it or break it logic

..so this is specifically for when you are using SSSD and not NIS, not any other remote authn/authz methods such as KRB5 without SSSD.

Rocks and /install directory on compute nodes

I noticed recently that some of our compute nodes were getting a bit short on disk space, as we have a fairly large set of (multiple versions of) large applications installed into /opt on each node rather than sharing those applications over NFS via /share/apps (which is apparently the usual Rocks way).

In looking at this I noticed that the /install directory on each compute node contained a complete copy of all packages installed during rebuild! In our case the /install directory was using about 35GB, mostly because of multiple versions of Matlab and Mathematica being installed (which are up to 10GB each now…).

Anyhow, to avoid this you can convert from using the <packages> XML tags in your extend-compute.xml file to using a post-install script (the <post> section) which calls yum install explicitly for the packages you want to install.
Be sure to also run yum clean packages regularly in the script, otherwise you’re just moving the packages from /install into the yum package cache in /var/cache/yum !

e.g. Convert this:

<package>opt-matlab-R2016a</package>

..into this:

<post>
yum install -y opt-matlab-R2016a
yum clean packages
</post>

This has allowed us to continue installing packages locally until we use up that extra 35GB 🙂

Users with UIDs < 1000 : workarounds and how RHEL 7.2 breaks some of these hacks

When RHEL/CentOS 7.2 was released there was a change in PAM configs which authconfig generates.
For most people this won’t have made any difference, but if you occasionally use entries in /etc/passwd to override user information from other sources (e.g. NIS, LDAP) then this can bite you.

The RHEL bug here shows the difference and discussion around it, which can be summarised in the following small change.

In CentOS 7.1 you see this line in the PAM configs:

auth sufficient pam_unix.so nullok try_first_pass

…whilst in 7.2 it changes to this:

auth [success=done ignore=ignore default=die] pam_unix.so nullok try_first_pass

The difference here is that the `pam_unix` entry essentially changes from (in PAM terms) “sufficient” to “required”, and any failure there means that authentication is denied.

“Well, my users are in LDAP so this won’t affect me!”

Probably, but if you happen to add an entry to /etc/passwd to override something for a user, such as their shell, home directory or (in this case) their UID (yes, hacky I know, but old NIS habits die hard…):

testuser:*:12345:12345:Test User:/home/testuser:/bin/bash

..then this means that your user is defined as being a target for the pam_unix module (since it’s a local user defined in the local passwd file), and from 7.2 you hit that modified pam_unix line and get auth failures. In 7.1 you’d get entries in the logs saying that pam_unix denied access but it would continue on through the subsequent possibilities (pam_ldap, pam_sss or whatever else you have in there) and check the password against those.

The bug referenced above suggests a workaround of using authconfig --enablenis, as this happens to set the pam_unix line back to the old version, but that has a bunch of other unwanted effects (like enabling NIS in /etc/nsswitch.conf).

Obviously the real fix for our particular case is to not change the UID (which was a terrible hack anyway) but to reduce the UID_MIN used in /etc/logins.def to below the minimum UID required, and hope that there aren’t any clashes between users in LDAP and users which have already been created by packages being added (possibly ages ago…).

Hopefully this saves someone else some trouble with this surprise change in behaviour when upgrades are applied to existing machines and authconfig runs!

Some additional notes:

  • This won’t change until you next run authconfig, which in this case was loooong after the 7.2 update…
  • Not recommended : Putting pam_sss or pam_ldap before pam_unix as a workaround for the pam_unix failures in 7.2. Terrible problems happen with trying to use local users (including root!) if the network goes away.
  • Adding the UID_MIN change as early as possible is a good idea, so in your bootstrap process would be sensible, to avoid package-created users getting added with UIDs near to 1000.
  • These Puppet modules were used in creation of this blog post:
    • joshbeard/login_defs 0.2.0
    • sgnl05/sssd 0.2.1

Merging SELinux policies

We make extensive use of SELinux on all our systems. We manage SELinux config and policy with the jfryman/selinux Puppet module, which means we store SELinux policies in plain text .te format – the same format that audit2allow generates them in.

One of our SELinux policies that covers permissions for NRPE is a large file. When we generate new rules (e.g. for new Nagios plugins) with audit2allow it’s a tedious process to merge the new rules in by hand and mistakes are easy to make.

So I wrote semerge – a tool to merge SELinux policy files with the ability to mix and match stdin/stdout and reading/writing files.

This example accepts input from audit2allow and merges the new rules into an existing policy:

cat /var/log/audit/audit.log | audit2allow | semerge -i existingpolicy.pp -o existingpolicy.pp

And this example deduplicates and alphabetises an existing policy:

semerge -i existingpolicy.pp -o existingpolicy.pp

There are probably bugs so please do let me know if you find it useful and log an issue if you run into problems.