Analytics API is unreachable and data ingestion delayed, analytics may falsely appear empty
Incident Report for Swapcard
Postmortem

Please see our post-mortem below regarding a service disruption that affected the Analytics API and related services from May 1, 2023 at 4:29 AM UTC through to 7:42 PM UTC. Affected customers may have been impacted by varying degrees and with a shorter timescale, this timeline take in account the first events and the official resolving time after close monitoring.

It is our goal in this post-mortem to provide details on our initial assessment of the incident communicated on the Swapcard status page and to describe the remediation actions that we have taken to restore service.

Incident summary

On May 1, 2023 at 4:29 AM UTC, we experienced an increase of reports regarding issue with the “Lead Board” page on the Exhibitor Center, the reports were the sunset of the Analytics API issues.

On May 1, 2023, our Analytics database trigger an auto-scale of disk size because of reaching threshold (usage/free spaces), the auto-scale has alter our database indexes, causing long queries running that cause a cascade failure on Analytics API, causing issue on “Lead Board” feature and Developer API (only on Analytics Endpoints).

Swapcard monitoring took time to detected the database disruption mostly because of the database not being completly unreachable, the only report were support related and about “Lead Board” page. Because of various report of malfunction of analytics related feature the Swapcard Incident Response team were triggered, Swapcard’s team worked to triage and restore services to mitigate customer impact. In parallel, the cause of the issue was investigated and mitigations were put in place.

Mitigation deployment

In favour of restoring the Analytics services, our first mitigation has been too attempt an hard restart of the database, has documented in our response plan, to free stacking queries (force query termination), we notice that the restart were not providing the expected effect and the queries were still stacking and not getting resolved.

Some Analytics queries were properly resolved at that time, thanks to caching system tampering the issue. The number of succeed queries was at various degree, according to the freshness of events performing API requests.

At this time the underline issue were not yet discovered and the correlation with previous events, were not made. After few attempts looking at traffic incoming to exclude (slow DDOS), and malfunction from the circuit breaking, Swapcard Incident Response team discovered a gap in the Analytics database indexes (In fact the index were existing but alter due to previous events)

Once the underline issue discovered, and in favour restoring the “Lead Board” and specially the export lead button as fast as possible, the team has switch the Analytics API to an empty database the time to restore the indexes (Causing analytics may falsely appear empty). Due to the load on the database, restoring indexes at the same time than receiving long and costly queries were not possible and would have largely extend the resolution of the incident.

Once the indexes were restored to their proper state, the Analytics API has been switch back to the normal database and pipeline has been restored to ensure analytics metrics were properly computed and delivered. No data has been lost during the process, no data has been alter.

At 7:42 PM UTC, Swapcard confirmed that the restoration was completed and API & underline features restored.

Event Outline

Events of 2023 May 1st (UTC)

(4:10 AM UTC ) | Automatic disk auto-scaling on our Analytics Database because of reaching threshold (usage/free spaces).

(4:29 AM UTC) | Increase of reports regarding issue with the “Lead Board” page on the Exhibitor Center

(3:00 PM UTC) | Swapcard Engineering found the underline issues and start to established plan to restore indexes.

(6:40 PM UTC) | Analytics API is back online and “Lead Board” page is reachable, while Swapcard Engineering is monitoring internal systems and database indexes recovery.

(7:42 PM UTC) | Status post resolved

Affected customers may have been impacted by varying degrees and with a shorter duration than as described above.

Forward Planning

In accordance with our high standard in terms of deliverability, Swapcard will conduct an internal audit on the autoscaling and failover capabilities used by the Analytics database. Automatic capacity upgrades and failover replica were already in place but today’s incident highlights the need for improvement.

We also find a gap in the monitoring and automatic issues detection on our Analytics Database that has been resolved on 2th May, after a post-mortem audit and mitigation plan prepared on 1st May.

We consider the likelihood of a recurrence of this issue to be low and will further reduce the risk as we make future interventions and improvements to our infrastructure and procedures.

Posted May 02, 2023 - 14:16 UTC

Resolved
This incident has been resolved. A more detail post-mortem will be published further, as soon as short/mid & long term mitigation are in place and planned.
Posted May 01, 2023 - 18:19 UTC
Monitoring
A fix has been implemented. We are working on restoring the analytics data, there are no data loss during the process.
Posted May 01, 2023 - 17:49 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 01, 2023 - 17:23 UTC
This incident affected: Studio, Exhibitor Center, and Developer API.