Postmortem owners | Mark Ryan, Network Engineer ⎮ Michael Weir, Systems Engineer ⎮ Michael Rhoades, Manager, Service Center |
---|---|
Incident | Infrastructure router crash/module failure |
Priority | P1 |
Affected services | Partial network connectivity at MacStadium’s Dublin data center |
At 20:25 UTC on Sunday, September 13, 2020 MacStadium’s Dublin data center experienced a partial network outage affecting some customers. This outage lasted for 2 hours and was due to a Supervisor module crash on a key infrastructure router (and the hardware failure of one of its modular components). The supervisor module was manually rebooted and the majority of traffic was restored. A sub-module had experienced a failure that additionally required a hardware replacement. Service to all customers was restored at 22:20 UTC.
Instructions | Report |
---|---|
Timeline | 2020-08-13, 9:12 UTC – a MacStadium Technician escalates an unusual pattern of alerts to the Incident Manager. The Engineering Team begins an immediate investigation. ⎮ 19:25 UTC – MacStadium Engineers confirm there is a partial outage affecting some customers in the Dublin data center and continue with troubleshooting. ⎮ 19:35 UTC – It was determined a key infrastructure router had crashed. Engineers reboot the router, which restored traffic to the majority of affected servers. ⎮ 19:50 UTC – One of the router sub-modules fails to power back on, and Engineers are dispatched to the Dublin data center to continue the investigation. ⎮ 19:54 UTC – The Incident Manager updates MacStadium’s Status Page with investigation status. ⎮ 20:20 UTC – Engineers arrive on site. They test and confirm the failure of the module. They locate, install, and test a replacement module. Once the cabling is fully reconnected to the new module, all traffic is restored. ⎮ 21:26 UTC – Status Page is updated that the incident cleared. |
Root cause | The primary supervisor module on a key infrastructure router crashed, necessitating a reboot of the router. After reboot, it was found that a sub-module had failed and required hardware replacement. |
Follow-up Actions | Approximately one year ago MacStadium set out to upgrade its Dublin data center to a new network architecture in order to provide greater east-west bandwidth, resiliency, and scale. The final piece of that plan will be executed by December 2020. Unfortunately, this incident occurred due to a single point of failure in the Dublin network that our network upgrade will address. |