Starting at approximately 21:41 UTC on Sunday, January 14, 2024, Akamai was first alerted by the Chicago datacenter provider of an equipment failure. This failure resulted in the Akamai Compute support team receiving connectivity error reports from customers. During the incident, customers were receiving timeout errors and were unable to access their compute instances.
During the Akamai incident response team’s investigation, they found that the issue was caused by the third-party datacenter’s cooling system failure. This caused Akamai and other peer network equipment in the datacenter to reach temperatures beyond operating levels. Any Akamai equipment affected by the rise in temperature was safely removed from operation. We worked with the datacenter provider and restored the connectivity temporarily by rerouting traffic from the equipment that needed to be removed. These actions were completed around 05:35 UTC on Monday, January 15, 2024.
As of 14:12 UTC on January 16, 2024, the failed equipment was restored to operational status, as the cooling system within the datacenter had been fixed and temperatures returned to normal equipment operating levels. At this point connectivity within the Chicago datacenter was fully restored.
Moving forward to aid in building more resilient infrastructure, we will review this incident to take action on any process improvements to prevent this from reoccurring. Along with reviewing our temperature alerting for any deficiencies within the various components used for early warning.