Introduction
On 05/09/2022 Tink experienced major performance degradation between 11:28 to 11:46 (CET), affecting the performance of the authentication flow which is present in the majority of our products.
We know how important it is to our customers that we provide a highly available and reliable service and we sincerely apologize for the disturbances this incident caused.
Root Cause Analysis
The root cause of this outage was a configuration change that was deployed by one of our engineering teams at 11:20 (CET). The configuration change led to an increase of traffic towards an internal component, which was not configured to scale to the level needed by the increase in traffic. As a result, the component became overloaded, which resulted in increased response times for dependent services.
Due to the lower amount of traffic in the staging environment this was not caught while the change was live in staging.
Remediation
Alerts went off at 11:31 (CET), a minute later the incident was called and we found the root cause within a few minutes after that. At 11:38 we removed the cause of the extra load and systems started to recover. At 11:46 the systems had fully recovered and performance was back to normal.
Follow-up actions
As a consequence of this incident we will focus on the following actions to prevent similar scenarios from occurring again: