At 14:18 UTC on 2021-02-19, we observed connection failures in our Tokyo PoP related to our underlying cloud provider hardware failure, which they declared as an incident. The failures at the cloud provider caused multiple alerts, which buried the alert that pointed to the core issue, which would have shown this failure much earlier to enable a faster time to resolution. Once the issue was identified we were able to quickly route around the provider at 17:05 UTC.
There was no further customer impact after 18:25 UTC. And at 21:01 UTC the underlying hardware issue recovered and we were able to return our system to its normal operating posture and we resolved the incident.
To reduce the impact of a similar issue in the future, we are going to create better alerts for these kinds of edge failures so we can more quickly identify the issue and take action.