Processing pipeline halt
Incident Report for Rollbar
Postmortem

What Happened

  • Several months ago we made some changes to our partitioning model for our raw item storage. These changes enabled us to migrate onto much larger disk sizes, reducing our operational overhead and improving our stability.
  • One of these partitioning changes resulted in a the leading shard’s active partition reaching the maximum file size for the filesystem.

  • At 1:42AM PST 10/26 The partition hit the max file size for filesystem

  • MySQL was unable to write to the partition, and our ingestion service began to buffer incoming data in Kafka.

  • We attempted to reorganize the leading partition (split it into multiple smaller partitions), but this process was going to take a prohibitively long time.

  • Instead, it was determined the fastest and safest path forward was to roll out a new shard.

  • By 5:10 PST we were ingesting traffic again and by 7:28PST we were caught up

Impact

  • No data was lost, no downtime for the API or Web App.
  • The processing pipeline was fully halted from 1:41AM PST - 5:08AM PST.

Resolution

We provisioned a new shard and moved processing onto that shard. We modified the partitioning scheme going forward so that this issue would never happen again, and added additional monitoring to ensure we are notified when file sizes reach the filesystem max.

Timeline

October 26 (All Times in PDT):

  • 1:41: AM: Leading partition hits filesystem max and ingestion halts.
  • 3:17 AM: After unsuccessful attempts to reorganize the partition on the fly, it is determined best path forward is to provision a new shard.
  • 5:08 AM: New shard is in service and pipeline is processing again.
  • 7:28 AM: Pipeline is fully recovered and all components are operational again.
Posted Nov 02, 2020 - 13:19 PST

Resolved
This incident has been resolved. Incoming events are processed and notifications are sent in real time, prioritized before the backlog of events accumulated during the incident.
Posted Oct 26, 2020 - 05:14 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 26, 2020 - 04:46 PDT
Update
We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 20 minutes, after that the processing pipeline will be operational again.
Posted Oct 26, 2020 - 04:33 PDT
Update
We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 15 minutes.
Posted Oct 26, 2020 - 04:10 PDT
Update
We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 30 minutes.
Posted Oct 26, 2020 - 03:37 PDT
Update
We are continuing to work on a fix for this issue. We estimate the fix to take effect in about one hour; until that new events are not showing up in the item list and notifications are not triggered.
Posted Oct 26, 2020 - 03:17 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 26, 2020 - 02:36 PDT
Investigating
We are currently investigating this issue.
Posted Oct 26, 2020 - 02:27 PDT
This incident affected: Processing pipeline (Core Processing Pipeline).