Increased latency on check creation

Incident Report for Onfido

Postmortem

Summary

On June 9th, the public API experienced degraded performances between 09:45 UTC and 10:05 UTC. A deployment of an internal system introduced an unexpected behaviour which prevented document reports to be processed.

In this timeframe, no data was lost, and checks which were created experienced a degradation of turn-around-time until the incident was recovered.

The deployment performed a database maintenance operation which, although the security checks and evaluations, created an exclusive lock that had to be manually fixed.

Root Causes

  • A planned deployment was done on a internal service
  • A database migration was released, that led to a database lock
  • The on-call incident team, manually stopped and fixed the underlying issue

Timeline

(Times are in UTC)

  • 09:45: The migration is deployed in production;
  • 09:53: The monitoring system alerts the team that we are in a presence of a locked database;
  • 10:00: The incident team manually stops the migration and fixes the underlying issue;
  • 10:05: We come back to our normal database load.

Remedies

  • To prevent similar incidents, ahead of applying similar migrations, active programmatic health check verifications will be put in place to check the validity of the operations that will be performed. As well as enforce a timeout in the database operations, so if any operations takes longer than the stablished time it stops executing.
Posted 3 years ago. Jun 17, 2022 - 09:16 UTC

Resolved

This incident has been resolved. We have confirmed that delays have subsided.

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again.
Posted 3 years ago. Jun 09, 2022 - 11:19 UTC

Monitoring

Our team has identified the source of delays. Thank you for your patience while we continue to work towards a resolution. We will update you once delays have completely subsided.
Posted 3 years ago. Jun 09, 2022 - 10:23 UTC

Update

We are currently investigating and we'll be back with an update shortly.
Posted 3 years ago. Jun 09, 2022 - 10:13 UTC

Investigating

We've currently experiencing issues that are negatively impacting latency on check creation.

We are currently investigating and we'll be back with an update in 15 minutes. Thank you for your patience.
Posted 3 years ago. Jun 09, 2022 - 10:08 UTC
This incident affected: Europe (onfido.com) (API), USA (us.onfido.com) (API), and Canada (ca.onfido.com) (API).