Moving a Rocks 6 cluster and changing the IP address of the head node

We have a Rocks 6 cluster which was definitely out-growing the server room/air-conditioned broom-cupboard in which it was housed, in terms of cooling, power consumption and a growing lack of physical space.
There were also many new nodes which wanted to be added but couldn’t without considerable upgrades to the room or a move to a more suitable space.

Moving it to a better home meant a change of upstream router and hence a change of IP address. All signs pointed to this being a terrible idea and a potential disaster, and all replies to related questions on the Rocks mailing list said that reinstalling the head node was the only viable solution. (Possibly using a “restore roll”)

We weren’t convinced that it would be less effort (or time taken) to reinstall than hack the config…

Internally others had moved IP and changed the hostname of a Rocks 5 cluster and had a long, long list of changes required. This was clearly going to be error-prone and looked like a worse idea than initially.

But then we found a page which claimed it was only a few “rocks set” commands and that’s all that was needed on Rocks 6!
( see http://davidmnoriega.wordpress.com/2012/06/22/rocks-cluster-changing-the-external-ip-address/ )

It did mention that there would be a couple of files in /etc to change too, but didn’t explicitly list everything.

Testing on VMs before the move…
We set about installing a VM on Virtualbox as a test head node, just with a default Rocks 6 config from the same install image which was used for the cluster (Rocks 6.0 “Mamba”). We then made a couple of compute node VMs for it to install and set about testing whether changing the IP using those instructions works fine.

First snag: VirtualBox Open-Source Edition doesn’t allow PXE boot from Intel ethernet adapters, because there is some non-GPL-compatible code which cannot be bundled in with it (ref: https://forums.virtualbox.org/viewtopic.php?f=9&t=34681 ).

Just pick a non-intel network adapter for the compute nodes and all is fine with the PXE boot, once the right boot order is set (PXE then HDD).

To ensure that we caught all references to the IP address, we used this:

find /etc -type f -print0 | xargs -0 grep a.b.c

..where a.b.c is the first three octets of the IP address as this is a /24 subnet we’re part of.
(This might be a bit more complicated if your subnet isn’t a /8, /16 or /24)

This finds:
/etc/sysconfig/network
/etc/sysconfig/network-scripts/ifcfg-em2
/etc/sysconfig/static-routes
/etc/yum.repos.d/Rocks-6.0.repo
/etc/hosts

Likewise we searched /opt as well as /etc, as the maui config and a couple of other bits and bobs are stored there; we found nothing which needed changing in /opt.

Then simply :

  • change all these IPs to the updated one (including changing static gateways)
  • run the few “rocks set” commands to update the ethernet interface and kickstart settings :


    rocks set host interface ip xxxxxxx ethX x.x.x.x
    rocks set attr Kickstart_PublicAddress x.x.x.x
    rocks set attr Kickstart_PublicNetwork x.x.x.x
    rocks set attr Kickstart_PublicBroadcast x.x.x.x
    rocks set attr Kickstart_PublicGateway x.x.x.x
    rocks set attr Kickstart_PublicNetmask x.x.x.x

  • physically move everything
  • (re)start the head node
  • rebuild all the nodes with:
    rocks set host boot compute action=install and (re)boot the nodes.
    They need this to point at the newly-configured head node IP, as it uses the external IP for static routing. (No idea why it doesn’t use the internal IP, but that’s for another post…)

And everything worked rather nicely!
This was then repeated on the physical machines with renewed confidence that it would work just fine.

Remember to check that everything is working at this point — submit jobs to the queues, ensure NFS is happy and that the nodes can talk to the required external services such as licence servers.

Problems encountered and glossed over by this:

  • Remember to make the configuration changes before you move the head node, otherwise it can take an age to boot whilst it times out on
    remote NFS mounts and other network operations
  • Ensure you have all sockets enabled (network and power) for the racks you are moving to, otherwise this could add another unwanted delay
  • Physically moving machines requires a fair bit of downtime, a van and some careful moving to avoid damage to disks (especially the head node)
  • Networking in the destination room was not terribly structured, and required us to thread a long cable between the head node and internal
    switch, as the head node is in a UPS’d rack which is the opposite side of the (quite large) room to the compute nodes. This will be more of a pain if any compute nodes need to also be on UPS (e.g. they
    have additional storage).
  • Don’t try to be optimistic on the time it will take — users will be happier if you allocate 3 days and it takes 2 than if you say it will take 1 and it takes 2.
  • Don’t have just broken your arm before attempting to move a cluster — it makes moving nodes or doing cabling rather more awkward/impossible! (Unlikely to be a problem for most people 🙂