Migrating gitlab projects

If you’re migrating a gitlab project from one server to another, unless the two gitlab instances are the same major revision you may run into a couple of problems with the export/import procedure.

The first error you’re likely to hit is something like:

The repository could not be imported.
Error importing repository into pp-computing/todo-list - Import version mismatch: Required 0.1.8 but was 0.1.6

This is because whenever there is a potentially “dangerous” change to the import script, gitlab “fails safe” and refuses to import the project. If the two numbers are reasonably close together (and your project is straight forward enough that you can carefully check the users, permissions, wiki pages, issues and milestones etc then you can try this to pretend that your export tarball is newer than it really is:


mkdir project_export
tar xfv old_export_file.tar.gz -C project_export
cd project_export
echo '0.1.8' > VERSION
tar czf experimental_new_project_export.tar.gz *

If you have milestones in your project, you may hit another error if you’re migrating from a gitlab instance that is older than 9.5 is:


Error importing repository into my-group/my-project - Validation failed: Group milestone should belong either to a project or a group.

The workaround for this one appears to be to import your project into your personal gitlab space, and then “move” it to your group space.

If you hit any errors not covered in the above, let us know below!

(And don’t forget you’ll need to update your remotes in any checked out working copies you have!)

Rocks Clusters – the httpd update that breaks your cluster and how to fix it

I’ve had a cluster running Rocks 6.2 (Sidewinder) for a few months and it has been working well. I recently had a request to add a new user, so I created the account with a minimal useradd command specifying only the comment, the uid, the group and the username, then I ran the ‘rocks sync users’ command which copies various files, including /etc/passwd to the nodes and restarts some daemons.

A few hours later the user got back to me to say his jobs were queued, but not running. So I used the checkjob command to what the problem was, and found that his uid was unknown on the node. Indeed looking at the password file on the node, I saw that his account was not there. So I rebooted the node, and ran rocks sync users again, with no joy. So I set the node to rebuild on boot and rebooted it, and it came up with no user accounts at all.

There were errors like this in the log:

Jul 27 17:39:43 compute-0-8 411-alert-handler[13333]: Error: http://10.1.1.1:372/411.d/etc.auto..masterupdating Could not get file ‘http://10.1.1.1:372/411.d/etc.auto..master’: 400 Bad

The nodes get the password files amongst other things from the head node using the 411 service. So running the command below on the node should get all the files.

411get –all

however all I got was

Error: Could not get file ‘http://10.1.1.1:372/411.d//’: 400 Bad

I could ssh to a node and use wget to get the files successfully which caused me more confusion.

I had updated the head node recently, and this turned out to be my problem. I asked on the Rocks mailing list, and the answer I got was:

The latest CentOS 6 httpd update breaks 411.  To fix, add this to the
end of /etc/httpd/conf/httpd.conf and reload httpd:

HttpProtocolOptions Unsafe

So I did that, and now rocks sync users is working again. The version of http which caused the problem was httpd-2.2.15-60.el6.centos.4.x86_64

I’m putting this here in case anyone else gets hit by this.

Service availability monitoring with Nagios and BPI

Several times, senior management have asked Team Wireless to provide an uptime figure for eduroam. While we do have an awful lot of monitoring of systems and services, it has never been possible to give a single uptime figure because it needs some detailed knowledge to make sense of the many Nagios checks (currently 2704 of them).

From the point of view of a Bristol user on campus here, there are three services that must be up for eduroam to work: RADIUS authentication, DNS, and DHCP. For the purposes of resilience, the RADIUS service for eduroam is provided by 3 servers, DNS by 2 servers and DHCP by 2 servers. It’s hard to see the overall state of the eduroam service from a glance at which systems and services are currently up in Nagios.

Nagios gives us detailed performance monitoring and graphing for each system and service but has no built-in aggregation tools. I decided to use an addon called Business Process Intelligence (BPI) to do the aggregation. We built this as an RPM for easy deployment, and configured it with Puppet.

BPI lets you define meta-services which consist of other services that are currently in Nagios. I defined a BPI service called RADIUS which contains all three RADIUS servers. Any one RADIUS server must be up for the RADIUS group to be up. I did likewise for DNS and DHCP.

BPI also lets meta-services depend on other groups. To consider eduroam to be up, you need the RADIUS group and the DNS group and the DHCP group to be up. It’s probably easier to see what’s going on with a screenshot of the BPI control panel:

BPI control panel

BPI control panel

So far, these BPI meta-services are only visible in the BPI control panel and not in the Nagios interface itself. The BPI project does, however, provide a Nagios plugin check_bpi which allows Nagios to monitor the state of BPI meta-services. As part of that, it will draw you a table of availability data.

eduroam uptime

eduroam uptime

So now we have a definitive uptime figure to the overall eduroam service. How many nines? An infinite number of them! 😉 (Also, I like the fact that “OK” is split into scheduled and unscheduled uptime…)

This availability report is still only visible to Nagios users though. It’s a few clicks deep in the web interface and provides a lot more information than is actually needed. We need a simpler way of obtaining this information.

So I wrote a script called nagios-report which runs on the same host as Nagios and generates custom availability reports with various options for output formatting. As an example:

$ sudo /usr/bin/nagios-report -h bpi -s eduroam -o uptime -v -d
Total uptime percentage for service eduroam during period lastmonth was 100.000%

This can now be run as a cron job to automagically email availability reports to people. The one we were asked to provide is monthly, so this is our crontab entry to generate it on the first day of each month:

# Puppet Name: eduroam-availability
45 6 1 * * nagios-report -h bpi -s eduroam -t lastmonth -o uptime -v -d

It’s great that our work on resilience has paid off. Just last week (during the time covered by the eduroam uptime table) we experienced a temporary loss of about a third of our VMs, and yet users did not see a single second of downtime. That’s what we’re aiming for.

Making suexec work…

suexec is a useful way of getting apache to run interactive magic (cgi scripts, php scripts etc) with a different user/group than the one that apache is running as.

Most configuration guides tell you:

  • Add “SuexecUserGroup $OWNER $GROUP” to your apache config
  • Look in /var/log/httpd/suexec.log to see what’s going wrong

What they don’t tell you, is that suexec makes some assumptions about where to find things it will execute, and that you can’t guarantee that log location is consistent across distros (or even versions of the same distro). I’m setting this up on CentOS 7, so the examples below were produced in that environment.

You can get useful information about both of the above by running the following:

[myuser]$ sudo suexec -V
-D AP_DOC_ROOT="/var/www"
-D AP_GID_MIN=100
-D AP_HTTPD_USER="apache"
-D AP_LOG_SYSLOG
-D AP_SAFE_PATH="/usr/local/bin:/usr/bin:/bin"
-D AP_UID_MIN=500
-D AP_USERDIR_SUFFIX="public_html"
[myuser]$

AP_DOC_ROOT and AP_USERDIR_SUFFIX control which paths suexec will execute. In this case we’re restricted to running stuff that lives somewhere under “/var/www” or “/home/*/public_html”

If you’ve got content elsewhere (for example, an application which expects to be installed under /usr/share/foo/cgi-bin) then it’s not sufficient to put a symlink from /var/www/foo/cgi-bin to /usr/share/foo/cgi-bin as suexec checks the actual location of the file, not where it was called from.

This is sensible, as it stops you putting a symlink in place which points at something nasty like /bin/sh.

AP_GID_MIN and AP_UID_MIN limit which users/groups suexec will run stuff as. In this case it won’t run anything with a GID < 100 or a UID < 500. This is sensible as it stops you running CGI scripts as privileged system users.

The GID limit is probably not an issue, but the UID limit might cause wrinkles if you look after one of the 18 UoB users who have a centrally allocated unix UID that is under 500 (because they’ve been here since before that was a problem)[1]

AP_LOG_SYSLOG is a flag that says "send all log messages to syslog" – which is fine, and arguably an improvement over writing to a specific log file. It doesn’t immediately tell you where those messages end up, but I eventually found them in /var/log/secure… which seems a sensible place for them to end up.

Once you’ve got all that sorted, you’ll need to make selinux happy. Thankfully, that’s dead easy and can be done by enabling the httpd_unified boolean. If you’re using the jfryman/selinux puppet module, it’s as easy as:


selinux::boolean { 'httpd_unified': }

I think that’s all the bumps I’ve hit on this road so far, but if I find any more I’ll update this article.

-Paul
[1] but then again, if you look after them (or one of the 84 users whose UID is under 1000) you’re probably already used to finding odd things that don’t work!

RHEL 7.2 Authconfig follow up — don’t mix local user info with sssd!

Quick follow-up on my previous post about authconfig with more info.

So it turns out that this was intentional, and the change was made because 2-facter authententication support was added to SSSD.
This was added as a fix for RHEL bug 1204864, with the following comment:

With the current configuration pam_unix will always prompt the user for a password. Letting SSSD ask users of 2FA again for the password will lead to a bad user experience. Letting SSSD only ask for the second factor will make it hard for applications like gdm to show specific 2FA dialogs.

This means that if you use a mix of local (/etc/passwd or /etc/shadow) and remote (via sssd) user information for a particular user, then the user in question will only auth against their local password.
If they don’t have a local password then they will be unable to authenticate.

This seems a particularly odd thing to change during a point-release of RHEL, as I would expect that using a mix of local and remote user information is more common than using 2FA with sssd…

I thought this was worth stating separately from the previous post, as it’s more general than just when performing hackery to change UIDs — any local user entry will cause this to happen when used in conjunction with sssd.

Additional info:

The code in sssd which enforces this is as follows (from authconfig-6.2.8/authinfo.py in the current CentOS 7.x git sources, line 3812):

  # do not continue to following modules if authentication fails
  if name == "unix" and stack == "auth" and (self.enableSSSDAuth or
    self.implicitSSSDAuth or self.enableIPAv2) and (not self.enableNIS):
    logic = LOGIC_FORCE_PKCS11 # make it or break it logic

..so this is specifically for when you are using SSSD and not NIS, not any other remote authn/authz methods such as KRB5 without SSSD.

Rocks and /install directory on compute nodes

I noticed recently that some of our compute nodes were getting a bit short on disk space, as we have a fairly large set of (multiple versions of) large applications installed into /opt on each node rather than sharing those applications over NFS via /share/apps (which is apparently the usual Rocks way).

In looking at this I noticed that the /install directory on each compute node contained a complete copy of all packages installed during rebuild! In our case the /install directory was using about 35GB, mostly because of multiple versions of Matlab and Mathematica being installed (which are up to 10GB each now…).

Anyhow, to avoid this you can convert from using the <packages> XML tags in your extend-compute.xml file to using a post-install script (the <post> section) which calls yum install explicitly for the packages you want to install.
Be sure to also run yum clean packages regularly in the script, otherwise you’re just moving the packages from /install into the yum package cache in /var/cache/yum !

e.g. Convert this:

<package>opt-matlab-R2016a</package>

..into this:

<post>
yum install -y opt-matlab-R2016a
yum clean packages
</post>

This has allowed us to continue installing packages locally until we use up that extra 35GB 🙂

Users with UIDs < 1000 : workarounds and how RHEL 7.2 breaks some of these hacks

When RHEL/CentOS 7.2 was released there was a change in PAM configs which authconfig generates.
For most people this won’t have made any difference, but if you occasionally use entries in /etc/passwd to override user information from other sources (e.g. NIS, LDAP) then this can bite you.

The RHEL bug here shows the difference and discussion around it, which can be summarised in the following small change.

In CentOS 7.1 you see this line in the PAM configs:

auth sufficient pam_unix.so nullok try_first_pass

…whilst in 7.2 it changes to this:

auth [success=done ignore=ignore default=die] pam_unix.so nullok try_first_pass

The difference here is that the `pam_unix` entry essentially changes from (in PAM terms) “sufficient” to “required”, and any failure there means that authentication is denied.

“Well, my users are in LDAP so this won’t affect me!”

Probably, but if you happen to add an entry to /etc/passwd to override something for a user, such as their shell, home directory or (in this case) their UID (yes, hacky I know, but old NIS habits die hard…):

testuser:*:12345:12345:Test User:/home/testuser:/bin/bash

..then this means that your user is defined as being a target for the pam_unix module (since it’s a local user defined in the local passwd file), and from 7.2 you hit that modified pam_unix line and get auth failures. In 7.1 you’d get entries in the logs saying that pam_unix denied access but it would continue on through the subsequent possibilities (pam_ldap, pam_sss or whatever else you have in there) and check the password against those.

The bug referenced above suggests a workaround of using authconfig --enablenis, as this happens to set the pam_unix line back to the old version, but that has a bunch of other unwanted effects (like enabling NIS in /etc/nsswitch.conf).

Obviously the real fix for our particular case is to not change the UID (which was a terrible hack anyway) but to reduce the UID_MIN used in /etc/logins.def to below the minimum UID required, and hope that there aren’t any clashes between users in LDAP and users which have already been created by packages being added (possibly ages ago…).

Hopefully this saves someone else some trouble with this surprise change in behaviour when upgrades are applied to existing machines and authconfig runs!

Some additional notes:

  • This won’t change until you next run authconfig, which in this case was loooong after the 7.2 update…
  • Not recommended : Putting pam_sss or pam_ldap before pam_unix as a workaround for the pam_unix failures in 7.2. Terrible problems happen with trying to use local users (including root!) if the network goes away.
  • Adding the UID_MIN change as early as possible is a good idea, so in your bootstrap process would be sensible, to avoid package-created users getting added with UIDs near to 1000.
  • These Puppet modules were used in creation of this blog post:
    • joshbeard/login_defs 0.2.0
    • sgnl05/sssd 0.2.1

Merging SELinux policies

We make extensive use of SELinux on all our systems. We manage SELinux config and policy with the jfryman/selinux Puppet module, which means we store SELinux policies in plain text .te format – the same format that audit2allow generates them in.

One of our SELinux policies that covers permissions for NRPE is a large file. When we generate new rules (e.g. for new Nagios plugins) with audit2allow it’s a tedious process to merge the new rules in by hand and mistakes are easy to make.

So I wrote semerge – a tool to merge SELinux policy files with the ability to mix and match stdin/stdout and reading/writing files.

This example accepts input from audit2allow and merges the new rules into an existing policy:

cat /var/log/audit/audit.log | audit2allow | semerge -i existingpolicy.pp -o existingpolicy.pp

And this example deduplicates and alphabetises an existing policy:

semerge -i existingpolicy.pp -o existingpolicy.pp

There are probably bugs so please do let me know if you find it useful and log an issue if you run into problems.

git – deleting local branches that were merged upstream

Like most people, we’re using git right at the centre of our puppet config management workflow. As I’ve mentioned previously, it features prominently in my top 10 most frequently used commands.

Our workflow is based around feature branches, and quite often we end up in a situation where we have a lot of local branches which have already been merged in the copy held upstream on github/gitlab/etc.

Today, I looked and noticed that while we only had 4 active branches on the gitlab server I had 41 branches locally, most of which related to features fixed a long time ago.

This doesn’t cause much of a problem although it can get confusing (especially if you’re likely to re-use a branch name in the future) – 41 branches is enough that deleting them one at a time by hand is tedious.

It looks like some gui tools/IDEs will take care of this for you, but I’m a command line kinda guy, and the git command line tools don’t seem to quite have this functionality baked in.

After a bit of poking about, I came up with the following approach which deletes any branch which no longer exists upstream.


# Delete all stale remote-tracking branches in origin.
git remote prune origin

# "git branch -vv" now includes the word "gone" against branches which the previous command removed, so
# use awk to identify those branches and plumb the list into "git branch -d" which will delete them locally
git branch -vv | awk '/: gone\]/ { print $1 }' | xargs git branch -D

The above seemed to do the right thing for the two repos I tested it on, but well… you might want to try it on something unimportant before you trust it!

NB if a branch has only ever existed locally (and never appeared under origin), it should leave it alone. But I’ve not tested that bit either.