Partial unavailability of telemetry data in Box
Incident Report for signageOS
Postmortem

Date

2023-03-07

Authors

Lukas Danek, CPO

Michael Zabka, CTO

Michal Artazov, DevOps Lead

Summary:

On the 7th of March 2023, a partial unavailability of telemetry data occurred in the Box service at signageOS. This issue was caused by a partial outage of the third-party service InfluxDB, which impacted the data retrieval and storage process. The issue was detected by our internal monitoring tool and promptly addressed. This report provides a detailed analysis of the impact, trigger, detection, root causes, and the steps taken for remediation.

Impact:

The partial unavailability of telemetry data in Box had minimal impact on the overall functionality of our system. While the issue affected the retrieval and storage of telemetry data, it did not impact any devices connected to signageOS, and no data was lost. However, the absence of real-time telemetry data limited the ability to analyze and monitor system performance accurately, which may have affected troubleshooting and diagnostics during the incident.

Trigger:

The trigger for the issue was a partial outage of the third-party service, InfluxDB, which disrupted the normal flow of telemetry data processing. The service interruption hindered the seamless retrieval and storage of telemetry data, resulting in partial unavailability.

Detection:

Our internal monitoring tool detected the partial unavailability of telemetry data by continuously monitoring the data flow and storage in the Box service. It raised alerts when it identified a deviation from the expected behavior, indicating a disruption in the telemetry data pipeline. The tool provided real-time visibility into the issue, enabling us to respond promptly and investigate the root cause.

Root Causes:

After thorough investigation, the following root causes were identified:

Partial outage of InfluxDB: The third-party service, InfluxDB, experienced a partial outage, causing disruptions in data retrieval and storage processes. This outage impacted the seamless flow of telemetry data into the system, resulting in partial unavailability.

Remediation:

To address the issue and prevent its recurrence, the following steps were taken:

Restoring InfluxDB service: As the root cause was traced to the partial outage of InfluxDB, we worked closely with the service provider to resolve the underlying issues and restore full functionality. The InfluxDB service was reinstated, ensuring the seamless retrieval and storage of telemetry data.

Additional caching mechanism: To mitigate the impact of future service disruptions, we implemented an additional caching mechanism within our system. This caching layer helps maintain a temporary storage of telemetry data, allowing for limited availability even during service interruptions.

Stability checks and communication with InfluxDB support: We engaged with the InfluxDB support team to ensure stability and reliability moving forward. We requested stability checks and collaborated on potential preventive measures to mitigate the risk of similar issues in the future.

By implementing these remediation steps, we have improved the resilience of our system, enabling continued availability of telemetry data and minimizing the impact of third-party service disruptions.

We apologize for any inconvenience caused by this issue and appreciate your patience and understanding as we worked to resolve it promptly. Our team remains committed to ensuring the highest level of service reliability and continuous improvement.

If you have any further questions or concerns, please feel free to reach out to our support team.

Posted May 26, 2023 - 22:25 CEST

Resolved
This incident has been resolved.
Posted Mar 07, 2023 - 22:09 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 07, 2023 - 17:14 CET
Update
We are continuing to work on a fix for this issue.
Posted Mar 07, 2023 - 11:50 CET
Identified
The internal dependency on the service InfluxDB has an incident: https://status.influxdata.com/
Posted Mar 07, 2023 - 11:49 CET
This incident affected: Box.