Network Connectivity Issue
Incident Report for Onfido
Postmortem

Summary

AWS had an outage of some network services in one Availability Zone (AZ) in the eu-west-1 region. This led to intermittent check processing failures starting from 22:28 31/08 until the issue was fully resolved at 00:04 01/09. See the Appendix below for a detailed description of the outage.

We rely on a number of the impacted AWS components, including ELB, Kinesis, SQS, RDS. The networking failure caused Onfido services to experience sporadic connectivity issues and instability, as follows:

  • Intermittent API timeouts were experienced by customers where the affected AZ was called. We estimate a 30% reduction in traffic due to this issue.
  • Completion of up to 50% of processed checks was disrupted as internal calls to various AWS services timed out. Most Checks were eventually completed through retries. However, TaTs were extended due to the higher internal error rates and elevated latency on some services.

Timeline  

31/08 22:28 UTC - The on-call team was alerted to elevated error rates with the Onfido API and other internal Onfido services. Investigation began.

31/08 22:39 UTC - AWS reported issues processing network packets for Network Load Balancers, NAT Gateway and PrivateLink endpoints in one of the AZs in the eu-west-1 region.

31/08 22:58 UTC - AWS published a suggested temporary fix, and the Onfido response team started working on applying this fix.

31/08 23:24 UTC - Network from AWS started to normalize, and internal traffic started to return to normal. Having disabled 'cross zone load balancing', further work on the AWS temporary fix was aborted.

31/08 23:35 UTC - AWS notified of a recovery to the networking component. Onfido API latencies started to recover back to normal, with lower internal error rates.

01/09 00:04 UTC - AWS report issue is fully resolved.

01/09 09:30 UTC - Emergency changes rolled back: 'cross zone load balancing' re-enabled.

Root Causes

An outage of some networking components in an AWS AZ in the eu-west-1 region led to network instability. This led to intermittent connection failures between Onfido services within the impacted AZ and other AWS services. There was disruption to the availability of a number of AWS services, including ELB, Kinesis, SQS, RDS and S3.

Onfido uses AWS load balancing to handle incoming traffic for our public API. There was an increase in timeouts for requests to the Onfido API where traffic touched the affected AZ.

Various internal services experienced timeouts as they tried to call affected AWS services. We use automatic retries mechanisms, so internal services were generally able to continue to process Checks, but with delays due to elevated error rates and latency.

Remedies

  1. Update runbooks to better handle AZ network outages. ETA: 26-Nov.
  2. Simplify network routing between services to remove any unnecessary dependencies on external AWS internet networking. ETA: 03-Dec
  3. (Already planned) Improve cross-AZ traffic segregation and infrastructure resiliency in the event of AZ impairment: design (ETA: 31-Dec); implementation Q1-Q2 2022.

Appendix - AWS Notice

According to AWS (https://status.aws.amazon.com/rss/elb-eu-west-1.rss):

“We can confirm network connectivity issues affecting a single Availability Zone (euw1-az2) in the EU-WEST-1 Region and are actively working on mitigation. Some other AWS services, including Lambda, ELB, Kinesis, SQS, RDS, CloudWatch and ECS, may also see impact as a result of this issue. A component within the subsystem responsible for the processing of network packets for Network Load Balancer, NAT Gateway and PrivateLink services became impaired and was no longer processing health checks successfully. This resulted in other components no longer accepting new connection requests, as well as elevated packet loss for Network Load Balancer, NAT Gateway and PrivateLink endpoints. For immediate mitigation for NLB, customers should (1) disable 'cross zone load balancing' on Network Load Balancer, and then (2) deregister any targets that are in euw1-az2. For NAT Gateway/PrivateLink, you may modify your route tables to direct traffic to NAT Gateways in other Availability Zones or you may disable PrivateLink endpoints in euw1-az2.”

Posted Nov 02, 2021 - 13:48 UTC

Resolved
This incident has been resolved. We have confirmed that the fix has solved the underlying error and that our service is back to normal in the EU region. Sorry for any inconvenience this has caused.
Posted Aug 31, 2021 - 23:25 UTC
Monitoring
The network issues has been identified and a fix has been implemented. Our service is returning back to normal and we'll keep on monitoring to ensure that incident is solved.
Posted Aug 31, 2021 - 22:55 UTC
Investigating
We're investigating a network connectivity issue in our AWS EU region and some user may see intermittent issues accessing our service. We apologize for the inconvenience this causes, and we'll be back with an update shortly.
Posted Aug 31, 2021 - 22:10 UTC
This incident affected: Europe (onfido.com) (API, Dashboard, Applicant Form, Document Verification, Facial Similarity, Watchlist, Identity Enhanced, Right To Work).