On March 16, 2023, a high internal load occurred within the database services in the APAC region, causing instability in the platform. The root cause was identified as requests being sent to the affected services before they were ready to receive traffic during a database system routine failover event. To address the issue, the operations team had to programmatically reduce running services across the region to isolate the load and allow the services to initialize stably. Health-check tuning and auto-scaling processes were also implemented to provide additional stability across the region. After hours of work, the team fully recovered the region with healthy services.
Vasion is now reviewing the load-balancing model used within the platform to identify areas where further tuning is required. The team is implementing a scheme of run levels to allow the environments to come online in stages instead of all at once, so services will be up, running, and stable before network traffic is sent to them. We are also optimizing the startup routines for speed and efficiency.
We have identified the primary factor for this incident was a bug in the Ubuntu operating system that was introduced into our environment on March 9, 2023, and has since been removed. For more information regarding this bug please refer to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2009325