[DUBLIN] Network Equipment Failure
Incident Report for MacStadium
Postmortem

Postmortem summary

Postmortem owners Mark Ryan, Network Engineer ⎮ Michael Weir, Systems Engineer ⎮ Michael Rhoades, Manager, Service Center
Incident Infrastructure router crash/module failure
Priority P1
Affected services Partial network connectivity at MacStadium’s Dublin data center

Executive summary

At 20:25 UTC on Sunday, September 13, 2020 MacStadium’s Dublin data center experienced a partial network outage affecting some customers.  This outage lasted for 2 hours and was due to a Supervisor module crash on a key infrastructure router (and the hardware failure of one of its modular components). The supervisor module was manually rebooted and the majority of traffic was restored. A sub-module had experienced a failure that additionally required a hardware replacement. Service to all customers was restored at 22:20 UTC.

Postmortem report

Instructions Report
Timeline 2020-08-13, 9:12 UTC – a MacStadium Technician escalates an unusual pattern of alerts to the Incident Manager. The Engineering Team begins an immediate investigation. ⎮ 19:25 UTC – MacStadium Engineers confirm there is a partial outage affecting some customers in the Dublin data center and continue with troubleshooting. ⎮ 19:35 UTC – It was determined a key infrastructure router had crashed. Engineers reboot the router, which restored traffic to the majority of affected servers. ⎮ 19:50 UTC – One of the router sub-modules fails to power back on, and Engineers are dispatched to the Dublin data center to continue the investigation. ⎮ 19:54 UTC – The Incident Manager updates MacStadium’s Status Page with investigation status. ⎮ 20:20 UTC – Engineers arrive on site. They test and confirm the failure of the module. They locate, install, and test a replacement module. Once the cabling is fully reconnected to the new module, all traffic is restored. ⎮ 21:26 UTC – Status Page is updated that the incident cleared.
Root cause The primary supervisor module on a key infrastructure router crashed, necessitating a reboot of the router. After reboot, it was found that a sub-module had failed and required hardware replacement.
Follow-up Actions Approximately one year ago MacStadium set out to upgrade its Dublin data center to a new network architecture in order to provide greater east-west bandwidth, resiliency, and scale. The final piece of that plan will be executed by December 2020. Unfortunately, this incident occurred due to a single point of failure in the Dublin network that our network upgrade will address.
Posted Sep 21, 2020 - 17:51 UTC

Resolved
Monitoring will continue, no further issues anticipated. Incident is resolved.
Posted Sep 14, 2020 - 14:32 UTC
Monitoring
Failed equipment has been replaced and the network issue should be stabilized. On site Engineers are currently monitoring the replacement equipment.
Posted Sep 13, 2020 - 21:26 UTC
Investigating
We have received alerts of a down network device that could affect several customers in the Dublin Datacenter, Engineers are onsite investigating the issue currently.
Posted Sep 13, 2020 - 19:54 UTC