Rocks Clusters – the httpd update that breaks your cluster and how to fix it

I’ve had a cluster running Rocks 6.2 (Sidewinder) for a few months and it has been working well. I recently had a request to add a new user, so I created the account with a minimal useradd command specifying only the comment, the uid, the group and the username, then I ran the ‘rocks sync users’ command which copies various files, including /etc/passwd to the nodes and restarts some daemons.

A few hours later the user got back to me to say his jobs were queued, but not running. So I used the checkjob command to what the problem was, and found that his uid was unknown on the node. Indeed looking at the password file on the node, I saw that his account was not there. So I rebooted the node, and ran rocks sync users again, with no joy. So I set the node to rebuild on boot and rebooted it, and it came up with no user accounts at all.

There were errors like this in the log:

Jul 27 17:39:43 compute-0-8 411-alert-handler[13333]: Error: http://10.1.1.1:372/411.d/etc.auto..masterupdating Could not get file ‘http://10.1.1.1:372/411.d/etc.auto..master’: 400 Bad

The nodes get the password files amongst other things from the head node using the 411 service. So running the command below on the node should get all the files.

411get –all

however all I got was

Error: Could not get file ‘http://10.1.1.1:372/411.d//’: 400 Bad

I could ssh to a node and use wget to get the files successfully which caused me more confusion.

I had updated the head node recently, and this turned out to be my problem. I asked on the Rocks mailing list, and the answer I got was:

The latest CentOS 6 httpd update breaks 411.  To fix, add this to the
end of /etc/httpd/conf/httpd.conf and reload httpd:

HttpProtocolOptions Unsafe

So I did that, and now rocks sync users is working again. The version of http which caused the problem was httpd-2.2.15-60.el6.centos.4.x86_64

I’m putting this here in case anyone else gets hit by this.