Data Capture Outage

Incident Report for Fullstory

Postmortem

We understand how important our service is to your business, and we take reliability and performance seriously.

On August 23, 2022, beginning at 7:55 a.m. EDT (11:55 a.m. UTC), an outage with one of our primary databases led to session capture failures and application UI degradation.

Approximately 50% of new capture requests failed between 7:55 and 11:35 a.m.
This incident affected session capture for both web and mobile apps, as well as events captured using our Server Events API. The affected sessions will not be available within FullStory.
Additionally, sessions successfully captured during this outage were delayed in processing, as processing was impacted for the duration of the incident. Some users experienced slow page loads and intermittent errors from the product UI.

Our goal is to avoid interruptions, and we use every opportunity to analyze the cause, learn from it, and minimize the chance of it happening again to improve future reliability.

This post mortem details the customer impact, the root cause of what happened, how we addressed the problem, and how we are committed to preventing it from happening in the future.

Customer Impact

Approximately 50% of new session requests failed throughout the duration of the incident. As a result, customers may notice a significant drop in the number of sessions for this period. Customers may also notice an underrepresentation of conversion rates and revenue attribution when looking across the affected time due to a lack of data. During this time, we also observed delays loading session replay or accessing other portions of the platform. Finally, the culmination of this issue led to a delay in captured sessions being available due to delayed indexing.

Customers using our mobile SDK may have experienced build errors that prevented apps from compiling.

Root Cause

The root cause of the incident was database errors impeding the creation of new device and session identification during session capture. Due to database infrastructure migration being performed by our cloud provider, additional contention on the creation of new identifiers led to unrecoverable failures in starting the capture process. The additional requirements of managing identifiers during the database migration, coupled with the volume of existing identifiers and the provisioning of additional query indexes, resulted in unexpected contention and exceeded the processing capacity expectations for the migration process. Advance plans to mitigate impact of the migration and to limit contention during the process were insufficient, and the migration needed to be terminated and rolled back. This caused a delay in restoration.

Resolution

Once we identified the root cause, the infrastructure migration was canceled and reverted, beginning at 9:56 a.m. and completing at 11:35 a.m., which enabled new identifiers to be allocated. Sessions initiated after this time were captured but not immediately processed. To recover from the canceled migration, additional dedicated database capacity was provisioned by our cloud provider and activated at 1:53 p.m. to eliminate processing delays and enable the backlog of captured sessions to be fully processed. All sessions were processed and available as of 3:45 p.m., and processing fully returned to normal.

Process Changes and Prevention

We are committed to preventing this incident in the future. We’ve completed the following action items:

Postponed database infrastructure migration until additional safeguards and capacity can be established.
Terminated the creation of additional query indexes impacting the start of session capture to reduce database contention.

Here are additional steps we’re taking:

Launch enhanced session capture protocol that reduces reliance on stateful database interaction to greatly improve resiliency, performance, and latency end to end during session capture.
Migrate capture of session information to optimized transaction journals to reliably and expediently store session data.
Aggressively clean up expired data and further optimize search and retrieval operations to reduce database capacity requirements.
Improve internal settings caching to improve resiliency to database issues during processing of captured session data.

Posted Aug 31, 2022 - 17:07 EDT

Resolved

Thanks for your patience and bearing with us. At this time, all issues have been resolved and we're back to normal operations. If you're still experiencing issues please reach out to support@fullstory.com and we'll dig in!

Posted Aug 23, 2022 - 15:52 EDT

Update

Data capture and mobile builds are back to full operation. We are still experiencing a delay in session indexing and will post an update on that issue by 4:30 PM EDT.

Posted Aug 23, 2022 - 14:26 EDT

Update

Our indexing and session processing pipeline continues to work through the backlog of recently captured sessions. At this time, new sessions may not be immediately available. A small subset of users may also experience unexpected timeouts and delays when loading parts of the FullStory UI including session replay. Mobile app builds may continue to fail intermittently at this time.

Posted Aug 23, 2022 - 13:08 EDT

Monitoring

At this time we're happy to report that the issue impacting data and session capture has been resolved.
Our processing and indexing pipeline is working through a backlog of captured sessions, so recently-captured sessions may not be immediately available.
Customers using our Mobile SDK may still experience intermittent build errors. We are continuing to investigate and will follow up shortly.

Posted Aug 23, 2022 - 12:07 EDT

Update

In addition to the previously mentioned impact we have identified that customers using our Mobile SDK may experience intermittent build errors as a result of this outage. We'll share more details as we have them. Thank you for your patience.

Posted Aug 23, 2022 - 11:29 EDT

Update

We have identified additional areas of impact:
New sessions that don't fail to capture are taking longer to index and process as a result they may not appear until several minutes later.
Session replay is also experiencing slowness to load.
We are continuing to work with our cloud services provider and explore alternative approaches in parallel.

Posted Aug 23, 2022 - 10:09 EDT

Identified

The root cause of the outage has been identified and is related to an error with one of our cloud service providers. We are working to mitigate the issue now.

Posted Aug 23, 2022 - 09:46 EDT

Investigating

FullStory session and data capture is currently experiencing elevated failure rates for all users. We are actively investigating the issue.

Posted Aug 23, 2022 - 08:56 EDT

This incident affected: Data Capture (Web Capture) and FullStory Web Application.