On both February 11th and February 21st, we experienced networking outages in our Fremont and Santa Clara data centers.
At 17:45 UTC on February 11th, an individual link, part of a multi-fiber bundle of circuits connecting the two data center facilities in the US-West region, went down.
This multi-fiber circuit is one side of a geographically diverse and redundant pair, referred to here as A-side and B-side. These circuits connect network nodes consisting of border routers and core switches between the Santa Clara and Fremont sites. Their purpose is to handle peer Linode traffic, as well as, share and route internet transit between sites. This individual link failure on the A side, interconnected two border routers, and the internet traffic rerouted over to the remaining B side link. No immediate customer impact was observed and the core switch links were still active. Failures like this are relatively common and the response team began investigation and repair actions.
At 2:45 UTC on February 12th, the entire multi-fiber circuit (B-side) went down. This isolated the border routers from each other. Customer internet traffic rerouted and continued to flow via the links between the Core switches on the A-side, however network performance was suboptimal.
At 8:45 UTC on February 12th, network engineers attempted to stabilize the network by removing all internet traffic from the border routers in the Santa Clara facility. The impact of this change, while not immediately observed, led to loss of internet connectivity for customers in Santa Clara.
At 10:13 UTC on February 12th, the change was reverted and service was restored. Shortly after, network engineers began work to restore the down circuits on both the A and B-side.
Initial troubleshooting of the failed link on the A-side did not reveal a discernible cause, and the issue was escalated, engaging the hardware vendor and opening a support case. During the initial discovery and troubleshooting with the vendor, a non-intrusive command triggered a rare bug causing a total failure of all links on the A-side. This completely severed the connectivity between the two data centers, resulting in a degraded service, or full scale outage in both data centers.
Network engineers were advised that a code upgrade of both circuits was required to mitigate the bug, and restore connectivity. The team prepared to perform an emergency code upgrade; these upgrades were completed at 16:25 UTC, restoring connectivity and traffic returned to the A side path by 17:29 UTC. At this point, the acute connection issues to US-West were resolved, though the B-side path between the sites was still down.
Work to restore the B-side continued with both the hardware vendor and fiber provider, in order to troubleshoot and locate the fault. At 18:30 UTC on February 14, we restored service to the B-side path and regained full redundancy between the US-West sites. The fault was determined to be a unidirectional loss of light across an intermediary segment of the circuit that connected both data centers. This is a very rare failure type and very difficult to identify in multi kilometer fiber spans. We are waiting for a final report from the vendor as to the cause.