AWS had an outage of some network services in one Availability Zone (AZ) in the eu-west-1 region. This led to intermittent check processing failures starting from 22:28 31/08 until the issue was fully resolved at 00:04 01/09. See the Appendix below for a detailed description of the outage.
We rely on a number of the impacted AWS components, including ELB, Kinesis, SQS, RDS. The networking failure caused Onfido services to experience sporadic connectivity issues and instability, as follows:
31/08 22:28 UTC - The on-call team was alerted to elevated error rates with the Onfido API and other internal Onfido services. Investigation began.
31/08 22:39 UTC - AWS reported issues processing network packets for Network Load Balancers, NAT Gateway and PrivateLink endpoints in one of the AZs in the eu-west-1 region.
31/08 22:58 UTC - AWS published a suggested temporary fix, and the Onfido response team started working on applying this fix.
31/08 23:24 UTC - Network from AWS started to normalize, and internal traffic started to return to normal. Having disabled 'cross zone load balancing', further work on the AWS temporary fix was aborted.
31/08 23:35 UTC - AWS notified of a recovery to the networking component. Onfido API latencies started to recover back to normal, with lower internal error rates.
01/09 00:04 UTC - AWS report issue is fully resolved.
01/09 09:30 UTC - Emergency changes rolled back: 'cross zone load balancing' re-enabled.
An outage of some networking components in an AWS AZ in the eu-west-1 region led to network instability. This led to intermittent connection failures between Onfido services within the impacted AZ and other AWS services. There was disruption to the availability of a number of AWS services, including ELB, Kinesis, SQS, RDS and S3.
Onfido uses AWS load balancing to handle incoming traffic for our public API. There was an increase in timeouts for requests to the Onfido API where traffic touched the affected AZ.
Various internal services experienced timeouts as they tried to call affected AWS services. We use automatic retries mechanisms, so internal services were generally able to continue to process Checks, but with delays due to elevated error rates and latency.
According to AWS (https://status.aws.amazon.com/rss/elb-eu-west-1.rss):
“We can confirm network connectivity issues affecting a single Availability Zone (euw1-az2) in the EU-WEST-1 Region and are actively working on mitigation. Some other AWS services, including Lambda, ELB, Kinesis, SQS, RDS, CloudWatch and ECS, may also see impact as a result of this issue. A component within the subsystem responsible for the processing of network packets for Network Load Balancer, NAT Gateway and PrivateLink services became impaired and was no longer processing health checks successfully. This resulted in other components no longer accepting new connection requests, as well as elevated packet loss for Network Load Balancer, NAT Gateway and PrivateLink endpoints. For immediate mitigation for NLB, customers should (1) disable 'cross zone load balancing' on Network Load Balancer, and then (2) deregister any targets that are in euw1-az2. For NAT Gateway/PrivateLink, you may modify your route tables to direct traffic to NAT Gateways in other Availability Zones or you may disable PrivateLink endpoints in euw1-az2.”