We understand how important our service is to your business, and we take reliability and performance seriously.
On August 23, 2022, beginning at 7:55 a.m. EDT (11:55 a.m. UTC), an outage with one of our primary databases led to session capture failures and application UI degradation.
Our goal is to avoid interruptions, and we use every opportunity to analyze the cause, learn from it, and minimize the chance of it happening again to improve future reliability.
This post mortem details the customer impact, the root cause of what happened, how we addressed the problem, and how we are committed to preventing it from happening in the future.
Approximately 50% of new session requests failed throughout the duration of the incident. As a result, customers may notice a significant drop in the number of sessions for this period. Customers may also notice an underrepresentation of conversion rates and revenue attribution when looking across the affected time due to a lack of data. During this time, we also observed delays loading session replay or accessing other portions of the platform. Finally, the culmination of this issue led to a delay in captured sessions being available due to delayed indexing.
Customers using our mobile SDK may have experienced build errors that prevented apps from compiling.
The root cause of the incident was database errors impeding the creation of new device and session identification during session capture. Due to database infrastructure migration being performed by our cloud provider, additional contention on the creation of new identifiers led to unrecoverable failures in starting the capture process. The additional requirements of managing identifiers during the database migration, coupled with the volume of existing identifiers and the provisioning of additional query indexes, resulted in unexpected contention and exceeded the processing capacity expectations for the migration process. Advance plans to mitigate impact of the migration and to limit contention during the process were insufficient, and the migration needed to be terminated and rolled back. This caused a delay in restoration.
Once we identified the root cause, the infrastructure migration was canceled and reverted, beginning at 9:56 a.m. and completing at 11:35 a.m., which enabled new identifiers to be allocated. Sessions initiated after this time were captured but not immediately processed. To recover from the canceled migration, additional dedicated database capacity was provisioned by our cloud provider and activated at 1:53 p.m. to eliminate processing delays and enable the backlog of captured sessions to be fully processed. All sessions were processed and available as of 3:45 p.m., and processing fully returned to normal.
We are committed to preventing this incident in the future. We’ve completed the following action items:
Here are additional steps we’re taking: