On March 13, 2023 at 06:01 UTC, our Support team alerted our administrators to issues with jobs processing on hosts in our Mumbai data center. After initial investigation by our admin team and further reports from our customers, our team identified a preliminary cause of the situation and launched our formal incident response procedures at 7:28 UTC. At 7:51 UTC, steps were taken to begin mitigation of the activity, and further action was taken at 8:15 UTC to address jobs which were stuck on a select subset of host machines in the data center as a result of this situation. At 9:07 UTC, the decision was made to launch a status page to keep customers informed of the progress of this investigation.
At 12:17 UTC, it was determined that the effects of the identified issue were also being detected in all of our other data centers. At 12:25 UTC, a misconfiguration in our storage infrastructure was identified as a central cause, and work immediately began to resolve this problem. At 13:24 UTC, a fix was found, which led to the status of the incident being moved to monitoring at 14:14 UTC. Implementation and continued monitoring of this fix went on in the time afterward, and at 17:00 UTC the decision was made to consider the incident resolved.
Moving forward, our administrators are examining ways we can ensure that changes to configurations of our infrastructure are properly reviewed and any unnecessary configurations are removed in order to proactively address any potential concerns and to avoid this type of situation from reoccurring. We are also reviewing the way we monitor and collect data on infrastructure configuration states and how that data can be used to minimize this type of incident in the future.