On June 14 we experienced an outage due to a bug in 4 of the 6 hardware switches in the core of the CM Cloud Network.
It triggered these 4 network components to lose connection with our cloud nodes in 2 separate data centers at the same moment.
Our CM Cloud Network is designed triple redundant, but loosing virtual servers in 2 of our 3 core data locations at the same time caused disruptions to some of our geo redundant database clusters.
Our Network Operation Center detected the connectivity issue within minutes and escalated the case in minutes.
After the switches restarted, the CM Cloud Infrastructure was fully up and running again within 45 minutes.
After 1 hours the majority of our services were fully operational again.
Â
Timeline (CEST):
15:21 - Connectivity issues detected
16:24 – First services recovering
16:27 – Majority of services operational
20:31 – All services operational
Â
What’s next?
Together with our supplier we conduct investigation on the failure of the network switches.
In the meantime we take extra measures within our DNS and Database clusters to improve resilience and speed up recovery time.
 This outage touched most of our services and we’re sorry for the impact to our customers and everyone who relies on them.