Increased occurrence processing pipeline latency

Incident Report for Rollbar

Postmortem

Beginning at 4:10 pm PT on 9/30/21, Rollbar's processing pipeline stalled and our marketing site was partially unavailable. The stall did not affect our ingestion service but did delay processing by at most 3h and 14m. Due to our LIFO semantics, this delay was likely not perceived to be as long.

During this delay, there was potential data loss due to API tier unavailability as well as for errors that had been retried the maximum amount of times. The total window for the potential data loss was between 4:10 and 5:52 pm.

The marketing site and processing pipeline partial outage was caused by one of our non-critical database tables reaching the maximum file size for our filesystem, (16 TB for ext4.) This caused our processing pipeline to halt due to our inability to insert rows into the table. This in turn caused an increased load on our API tier due to internal reporting of errors to api.rollbar.com. Which led to slow API response times. And finally, the slow API response times led to our Web tier becoming unavailable to serve our marketing site.

The database table in question is primarily used to check for duplicate occurrences and will report errors to api.rollbar.com. Since this check is performed in our processing pipeline we ran into an internal loop while processing and reporting errors. This type of behavior is rare due to various safe-guards in place, (e.g. limited retries, exponential backoff, etc.) but does happen from time to time.

In order to guarantee that this will not happen again we will be implementing the following:

Monitoring to alert our team of DB files approaching maximum filesystem limits
Partitioning for very large DB tables (which will make it easier to perform maintenance such as partition deletion)
Decouple our marketing site from our web application

Posted Oct 04, 2021 - 22:37 PDT

Resolved

Processing has fully caught up and this incident is resolved. We will write up a postmortem on this and the incident from yesterday and publish it to this status page early next week.

Posted Sep 30, 2021 - 19:24 PDT

Update

Processing delays are almost resolved and we will update once the pipeline is caught up.

Posted Sep 30, 2021 - 19:03 PDT

Monitoring

We've implemented a fix for the DB issue and we are now monitoring the backlog while processing catches back up.

Posted Sep 30, 2021 - 17:55 PDT

Update

We have identified the issue with one of our databases and are working to remediate. We are also experiencing intermittent website outages for the marketing site.

Posted Sep 30, 2021 - 17:40 PDT

Update

We are continuing to investigate this issue.

Posted Sep 30, 2021 - 16:46 PDT

Investigating

We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.

Posted Sep 30, 2021 - 16:34 PDT

This incident affected: Web App (rollbar.com) and Processing pipeline (Core Processing Pipeline).