Increased latency on accessing dashboard
Incident Report for Onfido
Postmortem

Summary

Users were not being able to login and access the Dashboard.

Root Causes

Due to an abnormal usage of our ElasticSearch cluster caused by a long running production ad-hoc job, the Dashboard application started presenting latency issues that affected overall access. Our provisioned ElasticSearch cluster infrastructure was not prepared to handle the amount of load generated.

Timeline

  • 7:30 GMT: Production job execution started.
  • 10:20 GMT: Dashboard alerts triggered.
  • 11:05 GMT: Job was interrupted by on-call team.

Remedies

Immediately:

  • Scale up Elasticsearch cluster provisioned infrastructure.

In addition, we will:

  • Improve our process to execute long running production jobs.
  • Improve scalability and reliability of Dashboard application.
Posted Jun 24, 2021 - 13:36 UTC

Resolved
This issue is now resolved:

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again.
Posted May 28, 2021 - 11:42 UTC
Monitoring
We have implemented a fix for this issue.

We are monitoring closely to make sure issue has been resolved and everything is working as expected. Please bear with us while we get back on our feet and we appreciate your patience during this incident.

We will update at 12:00 UTC
Posted May 28, 2021 - 11:33 UTC
Identified
The issue has been identified and a fix has been implemented.

We will provide a further update at 11:30 UTC
Posted May 28, 2021 - 11:16 UTC
Investigating
We're currently experiencing issues that are negatively impacting latency on the clients dashboard.

Investigation is ongoing and we will update at 11:30 UTC
Posted May 28, 2021 - 11:06 UTC
This incident affected: Europe (onfido.com) (Dashboard).