Rethinking the wireless database architecture

The eduroam wireless network has a reliance on a database for the authorization and accounting parts of AAA (authentication, authorization and accounting – are you who you say you are, what access are you allowed, and what did you do while connected).

When we started dabbling with database-backed AAA in 2007 or so, we used a centrally-provided Oracle database. The volume of AAA traffic was low and high performance was not necessary. However (spoiler alert) demand for wireless connectivity grew and before many months, we were placing more demand on Oracle than it could handle. The latency of our queries was taking sufficiently long that some wireless authentication requests would time out and fail.

First gen – MySQL (2007)

It was clear that we needed a dedicated database platform, and at the time that we asked, the DBAs were not able to provide a suitable platform. We went down the route of implementing our own. We decided to use MySQL as a low-complexity open source database server with a large community. The first iteration of the eduroam database hardware was a single second-hand server that was going spare. It had no resilience but was suitably snappy for our needs.

First gen database

First gen database

Second gen – MySQL MMM (2011)

Demand continued to grow but more crucially eduroam went from being a beta service that was “not to be relied upon” to being a core service that users routinely used for their teaching, learning and research. Clearly a cobbled-together solution was no longer fit for purpose, so we went about designing a new database platform.

The two key requirements were high query capacity, and high availability, i.e. resilience against the failure of an individual node. At the time, none of the open source database servers had proper clustering – only master-slave replication. We installed a clustering wrapper for MySQL, called MMM (MySQL Multi Master). This gave us a resilient two-node cluster whether either node could be queried for reads and one node was designated the “writer” at any one time. In the event of a node failure, the writer role would be automatically moved around by the supervisor.

Second gen database

Second gen database

Not only did this buy us resilience against hardware faults, for the first time it also allowed us to drop either node out of the cluster for patching and maintenance during the working day without affecting service for users.

The two-node MMM system served us well for several years, until the hardware came to its natural end of life. The size of the dataset had grown and exceeded the size of the servers’ memory (the 8GB that seemed generous in 2011 didn’t really go so far in 2015) meaning that some queries were quite slow. By this time, MMM had been discontinued so we set out to investigate other forms of clustering.

Third gen – MariaDB Galera (2015)

MySQL had been forked into MariaDB which was becoming the default open source database, replacing MySQL while retaining full compatibility. MariaDB came with an integrated clustering driver called Galera which was getting lots of attention online. Even the developer of MMM recommended using MariaDB Galera.

MariaDB Galera has no concept of “master” or “slave” – all the nodes are masters and are considered equal. Read and write queries can be sent to any of the nodes at will. For this reason, it is strongly recommended to have an odd number of nodes, so if a cluster has a conflict or goes split-brain, the nodes will vote on who is the “odd one out”. This node will then be forced to resync.

This approach lends itself naturally to load-balancing. After talking to Netcomms about the options, we placed all three MariaDB Galera nodes behind the F5 load balancer. This allows us to use one single IP address for the whole cluster, and the F5 will direct queries to the most appropriate backend node. We configured a probe so the F5 is aware of the state of the nodes, and will not direct queries to a node that is too busy, out of sync, or offline.

Third gen database

Third gen database

Having three nodes that can be simultaneously queried gives us an unprecedented capacity which allows us to easily meet the demands of eduroam AAA today, with plenty of spare capacity for tomorrow. We are receiving more queries per second than ever before (240 per second, and we are currently in the summer vacation!).

We are required to keep eduroam accounting data for between 3 and 6 months – this means a large dataset. While disk is cheap these days and you can store an awful lot of data, you also need a lot of memory to hold the dataset twice over, for UPDATE operations which require duplicating a table in memory, making changes, merging the two copies back and syncing to disk. The new MariaDB Galera nodes have 192GB memory each while the size of the dataset is about 30GB. That should keep us going… for now.

eduroam in Freshers’ Week: Some graphs

This year, eduroam is an interesting service. Not only has it been running on all-new Fog-based infrastructure since the mid-summer, but eduroam and ResNet Wireless have been merged to form one authenticated wireless network to rule them all with more users than ever before.

We’ve been watching how the servers are performing under load for the first time with great interest. Let’s have a look at some of the numbers. All of these graphs show the time from midnight on Saturday 21st September (UK freshers arrive) to midnight on Monday 30th September (end of weekend after freshers week).

First let’s have a look at the graph of RADIUS authentication requests – scaled to show Access-Request packets received per second (not necessarily authentications per second, as one authentication usually constitutes several Access-Requests).

The gentle swell of users on the weekend of the 28th/29th is mostly undergraduates using eduroam in residences. The taller spikes on weekdays is made up by University staff using eduroam on campus.

radius01 usually serves eduroam users on campus, at Bristol – both Bristol students and visitors from other eduroam institutions. radius02 usually serves Bristol users who are visiting other institutions, while radius03 usually authenticates any user authenticating at eduroam hotspots in Bristol City Council buildings, and certain local hospitals. However the servers have got each other’s backs, and will take on additional roles at will if there’s an outage.

There are a couple of large spikes on the graph of radius02 with matching notches on the graph of radius01. These are times when the servers decided to shuffle around and radius02 temporarily became the primary for campus users, which is by far the largest set of users.

radius01-total-access-challenges
radius02-total-access-challenges
radius03-total-access-challenges

Now for DNS. Pretty self explanatory – these graphs for the two DNS servers show the number of successful queries per second. At peak, times, we serve over 600 lookups per second.

dns5-success dns6-success

This graph shows the number of valid IPv4 DHCP leases currently in the lease file. We don’t currently graph IPv6 leases so there’s no way of knowing how many there are. Adding this is on my to-do list 🙂

dhcp1-eduroam-totals

Last but not least, this graph shows the number of queries per second for both nodes in the MySQL cluster that powers eduroam, ResNet and related services. It’s used for authentications, logging and infrastructure management. The two nodes are in a cluster and can rearrange themselves at will, but at any one time, there is one master and one slave. By default db1 is usually the master and handles mostly INSERTs and UPDATEs while db2 is usually the slave and handles only SELECTs.

db1-qps db2-qps