Summary

At 14:58 UTC on 14th February 2023, a pending infrastructure upgrade was triggered in the EU region. The change was still undergoing testing, and was applied to production due to human error. Shortly after, we began a failover process. The failover process encountered unexpected problems, which extended the period of instability.

API and SDK endpoints in the EU region were intermittently unavailable between 15:20-17:50 UTC. This resulted in c.30% less Checks being performed than normal during that period.

Root Causes

A pending infrastructure upgrade was planned to be tested in pre-production. Due to human error, the operation was instead triggered in production.

The nature of the upgrade did not allow for rollback of the change, forcing a failover.

Unexpected problems during the failover process, related to inaccurate replication of cluster configuration, extended the incident.

Timeline

14:58 UTC: An unplanned infrastructure upgrade was triggered in our primary EU cluster. This resulted in intermittent issues routing external traffic to the cluster.

15:07 UTC: We began to progressively failover to a secondary cluster.

15:20 UTC: An increase in error rates ensued, leading to a subsequent period of service instability.

15:30-16:25 UTC: Corrective actions taken to address issues in the failover process and stabilise the service.

16:25-16:55 UTC: Service stability improved, with some residual degradation. Monitoring continues.

16:55 UTC: Residual issues identified specific to manual processing.

17:55 UTC: Fixes completed to resolve residual issues. Service stable.

‌

Remedies

Amend administrator access permissions to further restrict production infrastructure upgrades (in progress).

Enhance cluster failover process to ensure reliable failover under exceptional conditions (ETA: March 2023).

Posted Feb 17, 2023 - 15:43 UTC

Resolved

This issue is now resolved: Increased latency on check processing in the EU region.

As the issue stands resolved, we expect a higher TaT for checks that need manual review as our backlog grew during this incident. So please bear with us while we clear all the backlog created and resume the typical TaT.

We take great pride in running a robust, reliable service and are working hard to ensure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.

Posted Feb 14, 2023 - 18:17 UTC

Monitoring

We have implemented a fix for this issue.

We are monitoring closely to ensure the issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet, and we appreciate your patience during this incident.

Our next update will be in 15 minutes. Thank you for your patience.

Posted Feb 14, 2023 - 17:59 UTC

Update

We are continuing to work on a fix for this issue.

Our next update will be in 30 minutes.

Posted Feb 14, 2023 - 17:31 UTC

Update

We are continuing to work on a fix for this issue.

Our next update will be in 15 minutes.

Posted Feb 14, 2023 - 17:17 UTC

Update

We are continuing to work on a fix for this issue.

Our next update will be in 15 minutes.

Posted Feb 14, 2023 - 16:56 UTC

Update

We are continuing to work on a fix for this issue.

Our next update will be in 15 minutes.

Posted Feb 14, 2023 - 16:40 UTC

Update

The issue has been identified, and a fix is being implemented.

We will provide a further update in 15 minutes.

Posted Feb 14, 2023 - 16:27 UTC

Update

We are continuing to work on a fix for this issue.

Our next update will be in 15 minutes.

Posted Feb 14, 2023 - 16:10 UTC

Identified

The issue has been identified, and a fix is being implemented.

We will provide a further update in 15 minutes.

Posted Feb 14, 2023 - 15:55 UTC

Investigating

We've currently experiencing issues that are negatively impacting latency on check processing in the EU region.

Our next update will be in 15 minutes. Thank you for your patience.

Posted Feb 14, 2023 - 15:39 UTC

This incident affected: Europe (onfido.com) (Document Verification, Facial Similarity, Right To Work).