Linode Status

Connectivity Issue - US-West

Incident Report for Linode

Postmortem

On both February 11th and February 21st, we experienced networking outages in our Fremont and Santa Clara data centers.

At 17:45 UTC on February 11th, an individual link, part of a multi-fiber bundle of circuits connecting the two data center facilities in the US-West region, went down.

This multi-fiber circuit is one side of a geographically diverse and redundant pair, referred to here as A-side and B-side. These circuits connect network nodes consisting of border routers and core switches between the Santa Clara and Fremont sites. Their purpose is to handle peer Linode traffic, as well as, share and route internet transit between sites. This individual link failure on the A side, interconnected two border routers, and the internet traffic rerouted over to the remaining B side link. No immediate customer impact was observed and the core switch links were still active. Failures like this are relatively common and the response team began investigation and repair actions.

At 2:45 UTC on February 12th, the entire multi-fiber circuit (B-side) went down. This isolated the border routers from each other. Customer internet traffic rerouted and continued to flow via the links between the Core switches on the A-side, however network performance was suboptimal.

At 8:45 UTC on February 12th, network engineers attempted to stabilize the network by removing all internet traffic from the border routers in the Santa Clara facility. The impact of this change, while not immediately observed, led to loss of internet connectivity for customers in Santa Clara.

At 10:13 UTC on February 12th, the change was reverted and service was restored. Shortly after, network engineers began work to restore the down circuits on both the A and B-side.

Initial troubleshooting of the failed link on the A-side did not reveal a discernible cause, and the issue was escalated, engaging the hardware vendor and opening a support case. During the initial discovery and troubleshooting with the vendor, a non-intrusive command triggered a rare bug causing a total failure of all links on the A-side. This completely severed the connectivity between the two data centers, resulting in a degraded service, or full scale outage in both data centers.

Network engineers were advised that a code upgrade of both circuits was required to mitigate the bug, and restore connectivity. The team prepared to perform an emergency code upgrade; these upgrades were completed at 16:25 UTC, restoring connectivity and traffic returned to the A side path by 17:29 UTC. At this point, the acute connection issues to US-West were resolved, though the B-side path between the sites was still down.

Work to restore the B-side continued with both the hardware vendor and fiber provider, in order to troubleshoot and locate the fault. At 18:30 UTC on February 14, we restored service to the B-side path and regained full redundancy between the US-West sites. The fault was determined to be a unidirectional loss of light across an intermediary segment of the circuit that connected both data centers. This is a very rare failure type and very difficult to identify in multi kilometer fiber spans. We are waiting for a final report from the vendor as to the cause.

Posted Mar 16, 2023 - 17:17 UTC

Resolved

The connectivity issues in the US-West region have been resolved.

Posted Feb 13, 2023 - 02:39 UTC

Update

We are continuing to monitor the corrections that were made to resolve the issues affecting our US-West data centers. If you are still experiencing issues, please open a Support ticket for assistance.

Posted Feb 12, 2023 - 21:32 UTC

Monitoring

At this time we have been able to correct the issues affecting connectivity in our US-West data centers. We will be monitoring this to ensure that it remains stable. If you are still experiencing issues, please open a Support ticket for assistance.

Posted Feb 12, 2023 - 20:14 UTC

Update

We are continuing to investigate the connectivity issue affecting US-West DCs and will keep providing additional updates as soon as possible.

Posted Feb 12, 2023 - 18:27 UTC

Update

We are continuing to investigate the connectivity issue affecting US-West DCs. We will continue to provide additional updates as soon as possible.

Posted Feb 12, 2023 - 17:17 UTC

Update

We are continuing to investigate the connectivity issue affecting US-West DCs. We will continue to provide additional updates as soon as possible.

Posted Feb 12, 2023 - 16:13 UTC

Update

We are continuing to investigate the connectivity issue affecting US-West DCs. We will continue to provide additional updates as soon as possible.

Posted Feb 12, 2023 - 15:13 UTC

Update

Our admin are continuing their work to address this issue. In addition to customers potentially experiencing issues connecting to Linodes, they may also have trouble performing actions such as provisioning, powering down, or rebooting services. We appreciate your patience and will continue to update this page with more information as it becomes available.

Posted Feb 12, 2023 - 13:58 UTC

Identified

Our admin are looking into additional reports of connectivity issues in US-West. Customers may again be unable to reach their Linodes during this time. We will provide further updates as we receive additional information.

Posted Feb 12, 2023 - 12:52 UTC

Monitoring

Our admin team has implemented a fix and we are monitoring the results. If you continue to experience issues connecting to Linodes in the US-West region please open a Support ticket for assistance.

Posted Feb 12, 2023 - 10:40 UTC

Investigating

We are currently investigating reports of connectivity issues in the US-West region. Customers may be unable to reach their Linodes during this time. We will provide further updates as we receive additional information.

Posted Feb 12, 2023 - 10:22 UTC

This incident affected: Regions (US-West (Fremont)).

Compute

Storage

Databases

Networking

Developer Tools

Delivery

Security

Services

Industries

Pricing

Community

Engage With Us