For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Major] Issues with Metadata Queries
Incident Report for Box
Postmortem

We recently addressed issues affecting the Metadata Query API. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

 

Between 9am and 2pm PT on May 5th, 2023, some users may have experienced difficulties while working in Box. During this time, some requests to the Metadata Query API returned stale or outdated results. The issue occurred due to a delay in the replication of instance create and update events to the query datastore. We were able to resolve the issue by filtering out some events that were causing retries, dead lettering and hence delaying replication. In addition, we have identified the underlying pattern of events that was causing delays in the replication pipeline and patched our replication processing code to handle them gracefully, to prevent similar issues from occurring in the future. 

Analysis 

When Metadata instances are mutated using the Box API or one of our applications, events containing the details of these mutations are emitted to a replication stream that is then processed by a pipeline without delay. This ensures the latest state is consistently available to be queried against using the Metadata Query API. We employ multiple optimizations to maintain the near-realtime characteristics of this query functionality, while ensuring all data is replicated consistently.

During this incident, one of the optimizations we leverage encountered an unexpected pattern of data, causing it return errors and trigger retries. These retries slowed down processing of the replication stream, causing lag to grow and resulting in queries returning outdated information for some customers. After identifying and filtering out these events causing errors, we were able to get the replication pipeline to once again successfully process events without adding delay. We also increased the rate at which these replication events are processed, in order to recover from the lag that had built up and fully catch up.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Identified and patched for event pattern that caused errors and hence retries, ensuring graceful handling of unexpected data patterns.
  • Increased batch size for replication stream processing to prevent lag from building up and to enable faster recovery from any backlogs in the future.
  • Improved observability and increased sensitivity of alerts, to ensure faster detection of errors and/or lag build up.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

 

Sincerely,

The Box Team

Posted May 17, 2023 - 07:57 PDT

Resolved
This incident has been resolved and Metadata Query functionality has been restored.
Posted May 05, 2023 - 14:16 PDT
Update
We are continuing to investigate this issue.
Posted May 05, 2023 - 12:22 PDT
Investigating
We are actively investigating customer reports of an issue that may impact Metadata queries for our Content API. We will provide updates here as soon as possible.
Posted May 05, 2023 - 12:22 PDT
This incident affected: Box Platform / API (Content API).