Down time post mortem report Feb 26th 2020

Incident Report for TrekkSoft

Resolved

Summary:
On the morning of February the 26th we migrated the TrekkSoft servers from our Cloudscale hosting provider in Zurich to the new Amazon Web Services in Ireland.
After the migration was complete, usage of the system increased as the merchants began to take bookings and use the system.
We spotted a major drop in performance. The root cause was one database host that was throttling under the amount of requests per second. This database slowdown caused a significant drop in performance to our applications (Merchants landing pages - CMS, Backoffice, public and private API and mobile apps), in some cases rendering them inoperable.

What Happened
6:45am - 8:29am CET - We completed the AWS migration.
We tested all the main cases and monitored all hosts and the preliminary results were satisfactory.
9:00am CET - Our applications began handling an increased amount of requests as the system came back online and usage of the system scaled up.
One of the main database hosts (MySQL) began struggling with the amount of requests. This affected the performance of our application, preventing normal functionality.
Contributing Factors
Uncertainty regarding the performance of the new AWS infrastructure vs CloudScale.
We compared all hosts in CloudScale vs AWS to ensure the same hardware requirements.
The infrastructures are different.
Steps Taken
Phase 1:
Increase the size of the database in AWS to increase performance (no downtime was required at this point).
Contact AWS support to provide for more information about the resizing time.
The database resize was to take AWS too long to deploy, so we decided to apply another workaround, described below.

Phase 2:
We put all the webapps in maintenance mode (down time).
We created a new, larger database (downtime was required to avoid data loss).
We extracted all data from one database to another, now using a migration system in AWS.
The new database created failed.
This required a new approach, described below.

Phase 3
We created a new empty database (again, downtime was required to avoid data loss).
We proceeded with a manual dump of the data from the old database to the new one. The process took 4 hours and was successful.
3:20PM CET The new infrastructure was ready to be released at aprox.
We have been monitoring and tweaking the system over the last 24 hours to improve performance.

Impact
Low number of bookings from 7:00am to 4:30pm CET (about 9 hours). Some merchants were unable to process any bookings, while others still managed to take some. The impact here is financial loss to all parties.

Benefits
The objective behind the migration that caused the issue.
Overall long term increase in performance.
Up to date industry infrastructure.
More direct control over our infrastructure.
Infrastructure ready to apply autoscaling in case of a peak of request per second.

Lessons learned
We will strategically time operation of this scale so that we have more time to react and avoid peak booking hours.
Triple-check hardware and settings specifications.
Bulletproof checklist.
Replicate the system and run stress tests.
Build our infrastructure with extra capacity and resources/have a larger infrastructure as a backup.

We apologize deeply for this incident.

Posted Feb 27, 2020 - 16:49 CET

This incident affected: TrekkSoft Application, TrekkSoft API, Backend Mobile Applications, POS Desk, Payyo, and Channel Manager.