Security upgrades causing impaired instances.
Incident Report for CloudAMQP
Postmortem

At Wednesday around 6AM UTC about 1% of our Ubuntu servers applied upstream updates using unattended-upgrade. Two of those packages was netplan.io and libnetplan0. The default configuration in unattended-upgrade is to upgrade one package at a time (Unattended-Upgrade::MinimalSteps "true";). netplan.io is dependent on libnetplan0, but those two packages were installed separately. When netplan.io was installed and reloaded after the update netplan generate SEGFAULT:ed due to the mismatching versions, that caused the systemd-networkd configuration to be invalid which resulted in the instance releasing its IP and not acquiring a new one.

When we figured out that netplan was segfaulting and that’s why we lost connectivity we applied apt-mark hold netplan.io on the remaining servers. What we didn’t realize at the time was that is was the version conflict between netplan.io and libnetplan0 that was the problem. So the next morning we had another 1% of servers that only upgraded libnetplan0, again causing a version mismatch which lead to SEGFAULT in netplan generate. This time it was more problematic because in the first wave libnetplan0 was eventually updated, so a restart of the server restored connectivity. But for the servers in the second wave the version mismatch was permanent. We then updated the cloud-init script on affected servers and forced it to manually run dhclient on boot. That restored connectivity and we could update the netplan.io version to resolve the version mismatch.

We’ve changed unattended-upgrade to apply updates in one go so that a similar version mismatch bugs can’t hit us again: Unattended-Upgrade::MinimalSteps "false";

All affected customers are eligible for refunds according to the SLA.

The netplan.io version mismatch bug is reported here: https://bugs.launchpad.net/netplan/+bug/1922898

Posted Apr 09, 2021 - 20:05 UTC

Resolved
This incident has been resolved.
Posted Apr 08, 2021 - 13:42 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 08, 2021 - 09:25 UTC
Investigating
We are currently investigating this issue.
Posted Apr 08, 2021 - 07:38 UTC
This incident affected: Shared servers and Dedicated servers.