On Friday 29th November at 09:02 we experienced a database failure on our master node, hosted on AWS Aurora. We are still looking into the root cause of this failure. While we operate multiple database nodes, hosted across 3 different availability zones (data centres), the master node is required for transactional processing and therefore was causing instability.
We operate hot replicas of our database at all times, allowing us to switch to them should an issue occur. This should have occurred automatically by our cloud monitoring system, but in this instance was not trigged. We manually intervened and forced a failover to the nominated replica server.
This failover process has been rehearsed and performed on multiple occasions and usually takes 1-2 minutes maximum.
The new node was promoted, along with 2 additional replicas to replace the failed node within a few minutes. At this point we realised our Web and API clusters were not connecting to the new database nodes, which were now available and online.
We had more than enough capacity to serve all the traffic we were receiving. However, our Web and API servers were now failing as they could not connect to the database, due to them persisting to hold connections to the old node.
To rectify this issue, we initiated a rolling restart of all servers, to force them to query the new database nodes and reconnect. This unfortunately took far too long, as during the restart, a configuration file we receive from AWS had an error, which prevented the servers from rebooting.
We rectified the issue with the configuration file manually and reinitiated the restart, which brought traffic back online to our Web servers and API by 09:35am
At this point our website and App were back up, we could see sales were flowing. Response times were normal.
A few minutes later, we received notification that web checkout transactions were failing. One of our API clusters had not recovered fully, which was responsible for some web transactions. We initiated another rolling restart of these servers, which did not resolve the issue after the restart.
In the case of a total failure of a server cluster, we have several routes to recover, such as pushing traffic to other clusters, which we initiated. These do take some time, as we have to wait for DNS updates, Load Balancers to bring up new clusters, etc. Due to the time taken to initiate various of these backup routes, plus time taken to troubleshoot the issue, we finally restored full processing at 10:32am
Skiddle host our website and infrastructure on Amazon Web Services. We are very experienced in handling large onsales for many high profile clients, and had perfect site uptime of 100% for over 3 months previous to this outage.
We pride ourselves in ensuring our platform is always available, and invest heavily in this area. We understand that if our platform isn’t available, your events can’t sell.
On this occasion we let you down. While we had multiple recovery and backups systems running and ready to go, unforeseen issues caused these to fail. Our capacity planning was good, but we failed to predict the issue that occurred and it took us too long to recover from it.
What went well
What we need to improve
Since our downtime
We have already improved our status page, which is available at https://skiddle.statuspage.io/ we’ve improved our internal processes to ensure this is kept up-to-date at all times. We’ve also built new emergency restart systems which are much faster and are not susceptible to errors from third parties. We are continuing to investigate the issues with clusters not respecting the new database configurations.