On January 5, 2024 at approximately 19:56 UTC, internal alerting notified our administrators that Object Storage was down in our Frankfurt datacenter. At this time, an investigation had begun to determine the root cause of the failure.
Customers were unable to create new buckets at the Frankfurt datacenter, and received 504 Bad Gateway errors when attempting to connect or make edits to existing buckets.
During the investigation, our administrators found that two legacy routers had failed. These routers affected Object Storage and other infrastructure within the Frankfurt datacenter. At 20:33 UTC, our administrators rebooted the two routers to restore connectivity to our Object Storage service. Once the reboots were complete, they found that all ports on the routers were in a suspended state, which required additional investigation. That investigation came to an end around 21:22 UTC and resulted in our administrators successfully un-suspending the ports. Our last step to achieving full mitigation was waiting on a backend dependency for our Object Storage service to become active, which took place at approximately 21:50 UTC January 5 2024. After network connectivity was restored we expected roughly 12 hours of intermittent latency issues while the service was recovering. Once the 12 hour window had lapsed, the Object Storage service fully recovered and all customer traffic returned to normal.
From this point, we monitored the routers for about 2 weeks to verify their stability. As we did not receive any further reports or alerts, we resolved the Linode status page on January 19, 2024.
On February 6th, 2024 we detached the last remaining Object Storage nodes from the legacy routers as part of our continual efforts to create a more resilient infrastructure that our customers can rely on.