On May 31st, at approximately 1600 UTC, a change was introduced to the API that caused LKE clusters to improperly scale. A gap in our automated testing did not capture this failure case, and this change was pushed to production. This caused some customer clusters to resize in unexpected ways and introduced additional billable resources to affected customer accounts.
Once identified, a hotfix was implemented to correct this behavior at 09:40 UTC June 1st. Additional worker nodes were left in place in order to prevent impacting customer workload. Affected customers were alerted via ticket for credit adjustments and instructions for further mitigation.
As a result of this incident, we will be auditing our tests and have created additional pre-deployment checklist items to ensure that this does not happen again.