Experiencing intermittent unavailability from several regions

Incident Report for signageOS

Postmortem

Date: March 17, 2024

Authors: Václav Boch, DevOps Engineer

Summary
A critical failure in one of our MongoDB telemetry clusters resulted in a regional system-wide outage, impacting core services. Recovery time 10 mins.

Impact

  • Complete regional outage of Box and API.
  • Both services returned 503 errors, preventing customer access.
  • Devices in the field were not impacted.

Trigger
The incident was triggered by a segmentation fault (SegFault) occurring in rapid succession across multiple servers in the MongoDB platform cluster.

Detection

  • The issue was initially identified through internal monitoring.
  • Customer support tickets confirmed the impact shortly after detection.

Root Cause

  • A segmentation fault (SegFault) in the MongoDB server led to instability and service disruption.

Remediation Actions

  • Schedule an upgrade to a newer MongoDB version across all clusters to enhance stability.
  • Improve microservice dependencies to ensure that future incidents result in only partial outages rather than full system failures.
Posted 2 months ago. Mar 17, 2025 - 19:04 CET

Resolved

This incident has been resolved.
Posted 2 months ago. Mar 13, 2025 - 20:18 CET

Monitoring

A fix has been implemented and we are monitoring the results.
Posted 2 months ago. Mar 13, 2025 - 20:13 CET

Identified

The issue has been identified and a fix is being implemented.
Posted 2 months ago. Mar 13, 2025 - 20:11 CET

Investigating

We are currently investigating this issue.
Posted 2 months ago. Mar 13, 2025 - 20:01 CET
This incident affected: API and Box.