Major outage for aggregation success rates

Incident Report for Tink

Postmortem

Introduction

On 05/09/2022 Tink experienced major performance degradation between 11:28 to 11:46 (CET), affecting the performance of the authentication flow which is present in the majority of our products.

We know how important it is to our customers that we provide a highly available and reliable service and we sincerely apologize for the disturbances this incident caused.

Root Cause Analysis

The root cause of this outage was a configuration change that was deployed by one of our engineering teams at 11:20 (CET). The configuration change led to an increase of traffic towards an internal component, which was not configured to scale to the level needed by the increase in traffic. As a result, the component became overloaded, which resulted in increased response times for dependent services.

Due to the lower amount of traffic in the staging environment this was not caught while the change was live in staging.

Remediation

Alerts went off at 11:31 (CET), a minute later the incident was called and we found the root cause within a few minutes after that. At 11:38 we removed the cause of the extra load and systems started to recover. At 11:46 the systems had fully recovered and performance was back to normal.

Follow-up actions

As a consequence of this incident we will focus on the following actions to prevent similar scenarios from occurring again:

Ensure service quotas are sufficient to handle 2X traffic, with additional dedicated alerting when quotas are exceeded
Ensure we have an improved mechanism in place to control the flow of batch traffic through the system

Posted May 11, 2022 - 11:37 CEST

Resolved

This issue has been resolved. We apologize for this inconvenience.

Posted May 09, 2022 - 14:02 CEST

Update

We are continuing to monitor for any further issues.

Posted May 09, 2022 - 13:05 CEST

Monitoring

The service is currently recovering and we are actively monitoring the situation.

Posted May 09, 2022 - 11:53 CEST

Identified

The issue has been identified and we are currently working on a solution.

Posted May 09, 2022 - 11:51 CEST

Investigating

We are seeing a major outage for our aggregation service this affects all products.
We will provide a progress update every 20 minutes.

Posted May 09, 2022 - 11:46 CEST

This incident affected: Payments, Account Check, Transactions, Business Account Check, Business Transactions, Income Check, Risk Insights, Money Manager, and Business Manager.