cluster | UoB Unix

First gen – MySQL (2007)

It was clear that we needed a dedicated database platform, and at the time that we asked, the DBAs were not able to provide a suitable platform. We went down the route of implementing our own. We decided to use MySQL as a low-complexity open source database server with a large community. The first iteration of the eduroam database hardware was a single second-hand server that was going spare. It had no resilience but was suitably snappy for our needs.

First gen database

Second gen – MySQL MMM (2011)

Demand continued to grow but more crucially eduroam went from being a beta service that was “not to be relied upon” to being a core service that users routinely used for their teaching, learning and research. Clearly a cobbled-together solution was no longer fit for purpose, so we went about designing a new database platform.

The two key requirements were high query capacity, and high availability, i.e. resilience against the failure of an individual node. At the time, none of the open source database servers had proper clustering – only master-slave replication. We installed a clustering wrapper for MySQL, called MMM (MySQL Multi Master). This gave us a resilient two-node cluster whether either node could be queried for reads and one node was designated the “writer” at any one time. In the event of a node failure, the writer role would be automatically moved around by the supervisor.

Second gen database

Not only did this buy us resilience against hardware faults, for the first time it also allowed us to drop either node out of the cluster for patching and maintenance during the working day without affecting service for users.

The two-node MMM system served us well for several years, until the hardware came to its natural end of life. The size of the dataset had grown and exceeded the size of the servers’ memory (the 8GB that seemed generous in 2011 didn’t really go so far in 2015) meaning that some queries were quite slow. By this time, MMM had been discontinued so we set out to investigate other forms of clustering.

Third gen – MariaDB Galera (2015)

MySQL had been forked into MariaDB which was becoming the default open source database, replacing MySQL while retaining full compatibility. MariaDB came with an integrated clustering driver called Galera which was getting lots of attention online. Even the developer of MMM recommended using MariaDB Galera.

MariaDB Galera has no concept of “master” or “slave” – all the nodes are masters and are considered equal. Read and write queries can be sent to any of the nodes at will. For this reason, it is strongly recommended to have an odd number of nodes, so if a cluster has a conflict or goes split-brain, the nodes will vote on who is the “odd one out”. This node will then be forced to resync.

This approach lends itself naturally to load-balancing. After talking to Netcomms about the options, we placed all three MariaDB Galera nodes behind the F5 load balancer. This allows us to use one single IP address for the whole cluster, and the F5 will direct queries to the most appropriate backend node. We configured a probe so the F5 is aware of the state of the nodes, and will not direct queries to a node that is too busy, out of sync, or offline.

Third gen database

Having three nodes that can be simultaneously queried gives us an unprecedented capacity which allows us to easily meet the demands of eduroam AAA today, with plenty of spare capacity for tomorrow. We are receiving more queries per second than ever before (240 per second, and we are currently in the summer vacation!).

We are required to keep eduroam accounting data for between 3 and 6 months – this means a large dataset. While disk is cheap these days and you can store an awful lot of data, you also need a lot of memory to hold the dataset twice over, for UPDATE operations which require duplicating a table in memory, making changes, merging the two copies back and syncing to disk. The new MariaDB Galera nodes have 192GB memory each while the size of the dataset is about 30GB. That should keep us going… for now.

We have a Rocks 6 cluster which was definitely out-growing the server room/air-conditioned broom-cupboard in which it was housed, in terms of cooling, power consumption and a growing lack of physical space.
There were also many new nodes which wanted to be added but couldn’t without considerable upgrades to the room or a move to a more suitable space.

Moving it to a better home meant a change of upstream router and hence a change of IP address. All signs pointed to this being a terrible idea and a potential disaster, and all replies to related questions on the Rocks mailing list said that reinstalling the head node was the only viable solution. (Possibly using a “restore roll”)

We weren’t convinced that it would be less effort (or time taken) to reinstall than hack the config…

Internally others had moved IP and changed the hostname of a Rocks 5 cluster and had a long, long list of changes required. This was clearly going to be error-prone and looked like a worse idea than initially.

But then we found a page which claimed it was only a few “rocks set” commands and that’s all that was needed on Rocks 6!
( see http://davidmnoriega.wordpress.com/2012/06/22/rocks-cluster-changing-the-external-ip-address/ )

It did mention that there would be a couple of files in /etc to change too, but didn’t explicitly list everything.

Testing on VMs before the move…
We set about installing a VM on Virtualbox as a test head node, just with a default Rocks 6 config from the same install image which was used for the cluster (Rocks 6.0 “Mamba”). We then made a couple of compute node VMs for it to install and set about testing whether changing the IP using those instructions works fine.

First snag: VirtualBox Open-Source Edition doesn’t allow PXE boot from Intel ethernet adapters, because there is some non-GPL-compatible code which cannot be bundled in with it (ref: https://forums.virtualbox.org/viewtopic.php?f=9&t=34681 ).

Just pick a non-intel network adapter for the compute nodes and all is fine with the PXE boot, once the right boot order is set (PXE then HDD).

To ensure that we caught all references to the IP address, we used this:

find /etc -type f -print0 | xargs -0 grep a.b.c

..where a.b.c is the first three octets of the IP address as this is a /24 subnet we’re part of.
(This might be a bit more complicated if your subnet isn’t a /8, /16 or /24)

This finds:
/etc/sysconfig/network /etc/sysconfig/network-scripts/ifcfg-em2 /etc/sysconfig/static-routes /etc/yum.repos.d/Rocks-6.0.repo /etc/hosts

Likewise we searched /opt as well as /etc, as the maui config and a couple of other bits and bobs are stored there; we found nothing which needed changing in /opt.

Then simply :

change all these IPs to the updated one (including changing static gateways)
run the few “rocks set” commands to update the ethernet interface and kickstart settings :
rocks set host interface ip xxxxxxx ethX x.x.x.x rocks set attr Kickstart_PublicAddress x.x.x.x rocks set attr Kickstart_PublicNetwork x.x.x.x rocks set attr Kickstart_PublicBroadcast x.x.x.x rocks set attr Kickstart_PublicGateway x.x.x.x rocks set attr Kickstart_PublicNetmask x.x.x.x
physically move everything
(re)start the head node
rebuild all the nodes with:
rocks set host boot compute action=install and (re)boot the nodes.
They need this to point at the newly-configured head node IP, as it uses the external IP for static routing. (No idea why it doesn’t use the internal IP, but that’s for another post…)

And everything worked rather nicely!
This was then repeated on the physical machines with renewed confidence that it would work just fine.

Remember to check that everything is working at this point — submit jobs to the queues, ensure NFS is happy and that the nodes can talk to the required external services such as licence servers.

Problems encountered and glossed over by this:

Remember to make the configuration changes before you move the head node, otherwise it can take an age to boot whilst it times out on
remote NFS mounts and other network operations
Ensure you have all sockets enabled (network and power) for the racks you are moving to, otherwise this could add another unwanted delay
Physically moving machines requires a fair bit of downtime, a van and some careful moving to avoid damage to disks (especially the head node)
Networking in the destination room was not terribly structured, and required us to thread a long cable between the head node and internal
switch, as the head node is in a UPS’d rack which is the opposite side of the (quite large) room to the compute nodes. This will be more of a pain if any compute nodes need to also be on UPS (e.g. they
have additional storage).
Don’t try to be optimistic on the time it will take — users will be happier if you allocate 3 days and it takes 2 than if you say it will take 1 and it takes 2.
Don’t have just broken your arm before attempting to move a cluster — it makes moving nodes or doing cabling rather more awkward/impossible! (Unlikely to be a problem for most people 🙂

UoB Unix

Linux and Unix at the University of Bristol

Tag Archives: cluster

Rethinking the wireless database architecture

First gen – MySQL (2007)

Second gen – MySQL MMM (2011)

Third gen – MariaDB Galera (2015)

Moving a Rocks 6 cluster and changing the IP address of the head node