Issue: On April 7, 2021 the systems engineering team promoted a kernel patch to core systems. This patch had been tested for an extended period of time against staged servers and observed with no impact, however 2.5 hours into the patching process monitoring alerts began to come in for several of our core services. The patch caused a kernel panic and put the servers affected into a locked-up state which required manual intervention to reboot, and rollback patch-levels.
Timeline/Impact Analysis The initial patching of the staging environment occurred in the final week of March, and was completed and in testing as of March 31,2021. The environment was observed for a period of one week before there was an internal approval to begin patching our production environment. On April 7, 2021 the patch-level promotion was pushed to our fleet of servers. 2.5 hours into the patching process alerting began and the team immediately acted to determine impact and identify the issue. Once determined that it was the patching event, there was a rollback of the promoted patch-level for the fleet. The rollback occurred 18 minutes after initial alerting began, however several core services were already impacted and bringing them back online was what caused the extended downtime. Specifically, manual intervention was required to reboot affected VMs and servers that did receive the patch-level promotion, and had a kernel-panic event.
Corrective Action/Prevention Plan The process behind how patch-level promotion roll-outs has been reevaluated, as has the order of critical core services in the promotion process. Specifically, there will be a growing list of servers exposed to the patch-level promotion with several more testing periods and all critical core services will be done in a phased approach on the tail end of the patching processes.