Overview
On March 2nd, at 20:23 UTC an incident was declared for the us-2 region. On-call engineers began receiving alerts for multiple sites down and immediately began an investigation.
What Happened
A maintenance operation applied to the storage layer unintentionally resulted in degraded performance for most of the grid hosts in the region. The purpose of the maintenance was to improve the storage layer by switching to an upgraded back end but the migration to this new system resulted in elevated CPU resource consumption leading to a major performance impact on any environment running on one of these affected hosts.
Resolution
In order to mitigate the impact of the incident, additional storage layer resources were provisioned and deployed to the region. The migration was able to continue as planned and grid hosts began to recover as expected.
Impact
We have paused this same maintenance across all regions to assess how to safely move forward.