Beginning at 4:10 pm PT on 9/30/21, Rollbar's processing pipeline stalled and our marketing site was partially unavailable. The stall did not affect our ingestion service but did delay processing by at most 3h and 14m. Due to our LIFO semantics, this delay was likely not perceived to be as long.
During this delay, there was potential data loss due to API tier unavailability as well as for errors that had been retried the maximum amount of times. The total window for the potential data loss was between 4:10 and 5:52 pm.
The marketing site and processing pipeline partial outage was caused by one of our non-critical database tables reaching the maximum file size for our filesystem, (16 TB for ext4.) This caused our processing pipeline to halt due to our inability to insert rows into the table. This in turn caused an increased load on our API tier due to internal reporting of errors to api.rollbar.com. Which led to slow API response times. And finally, the slow API response times led to our Web tier becoming unavailable to serve our marketing site.
The database table in question is primarily used to check for duplicate occurrences and will report errors to api.rollbar.com. Since this check is performed in our processing pipeline we ran into an internal loop while processing and reporting errors. This type of behavior is rare due to various safe-guards in place, (e.g. limited retries, exponential backoff, etc.) but does happen from time to time.
In order to guarantee that this will not happen again we will be implementing the following: