Degraded live_videos service
Incident Report for Onfido
Postmortem

Summary

Abnormally high amount of 422 errors returned for live videos uploaded via Safari web browser.

Root Causes

A change was introduced to fix a known security attack vector around file uploads, where instead of only relying on the "Content-Type" header provided in the upload requests, the file metadata content headers are also used to identify the video mime type, to then be matched against a whitelist of supported video formats (such as MP4, for example). This allows us to ensure that when a user uploads a video, it's really a video being uploaded, and not a spoof (such as an HTML page with malicious content).

This caused a surge of validation errors on video uploads when performed with the Safari browser. At Onfido, we use the MediaRecorder web API to record the video content. We found out that, although the recorded content is valid MP4, this content is stripped of any metadata. This results in the video format being incorrectly identified, thereby making the verification described above fail, causing the surge of validation errors.

Timeline

  • Before 2021-05-31: The security team warns the Biometrics team about the possibility of Content-Type spoof attack on the live video upload API endpoint. The Biometrics team prepares a fix by introducing stricter validations on upload that verify Content-Type of multipart upload is consistent with actual file contents. At this stage, the team is unaware there is an edge case for videos uploaded via the Safari web browser.
  • 2021-05-31 @ 12:00 UTC: New release of the Onfido API with the security fix (and regression).
  • 2021-05-31 @ 23:45 UTC: Internal monitor triggers for higher than usual error rates in video uploads (around 12h after the release).
    It is ignored, since error rates are not prevalent across other endpoints and it only triggered long after the release.
    Team fails to identify it as an issue only for video uploads, which is happening since the release.
  • 2021-06-01 @ 15:37 UTC: Onfido SDK team is warned via our Web SDK GitHub issue tracker for that fact that some clients are facing issues on video uploads.
  • 2021-06-01 @ 16:04 UTC: Status page is updated to mention incident affecting a subset of video uploads.
    The team continues to investigate what the issue may be, but it's unclear at this point since it seems to affect all SDKs equally.
    We are confident it must be related to the validations recently introduced in the live videos endpoint, so we prepare a rollback of the set of stricter validations.
  • 2021-06-01 @ 19:28 UTC: Validation rollback is tested in our testing environment.
  • 2021-06-01 @ 20:28 UTC: Validation rollback is live in Production.
  • 2021-06-01 @ 21:20 UTC: Status page is updated to indicate incident is over.
  • 2021-06-01 @ 21:20 UTC: Customers confirm incident is over.
  • After 2021-06-01: Issue is reproduced in local development environment. We nail down the problem to Safari MediaRecorder API recording videos stripped out of content metadata (as inspected by the Unix file command), even though videos are valid MP4.

Remedies

Immediately:

  • Dashboards and alerts for 422s and other error rate fluctuations were added, specifically for media upload endpoints for team Biometrics
  • New dashboards include annotation for when deployments happened, for easy correlation of release issues

In addition, we will:

  • Assess why alarm only triggered around 12h after the release (support ticket with our third-party provider)
  • Review our process of following up on triggered alerts, to try and identify patterns and correlation with releases
  • End-to-end test SDK and API together after each release is ready to go out
  • Add ability to filter error rates by endpoint in release dashboards
  • Change release process, for one of our larger services, to call out everyone involved in the code changes about to go out, so each team member is on the lookout for specific issues
Posted Jun 16, 2021 - 15:50 UTC

Resolved
This issue is now resolved:

We take a lot of pride in running a robust, reliable service and we're working hard to make sure this does not happen again. A detailed postmortem will follow once we've concluded our investigation.
Posted Jun 01, 2021 - 21:20 UTC
Identified
The changes which caused the unexpected behaviour are being reverted.
Posted Jun 01, 2021 - 17:26 UTC
Investigating
Affected clients are getting their live videos rejected incorrectly. We are currently investigating the underlying issue.
Posted Jun 01, 2021 - 16:24 UTC
This incident affected: Europe (onfido.com) (API).