US Intermittent outages

Incident Report for CC Status Page

Postmortem

Issue: Intermittent outages in the US region for community and control

At seemingly random times the platform would slow down and the load times of pages would suffer a degraded performance or timeout completely.

Cause:

We experienced a targeted massively distributed high-traffic load that, even though was eventually prevented by the platform firewall, was still high enough to take the US region down for around 15 minutes.

Due to an overwhelming volume of requests, our system experienced capacity overload. Within a span of minutes, we recorded hundreds of thousands of GET requests originating from thousands of unique IP addresses. These requests were primarily directed towards the root URL ("/") of a specific community.

Once capacity was overwhelmed, this then resulted in end users seeing load timeouts or error messages on screen.

‌

Resolution:

The issue was mitigated quickly on our side which helped to limit the affected time as well as damage to the platform. Our security and engineering teams successfully mitigated the attack, restoring normal service operations within 15 minutes when all systems were back to functioning as expected.

We can provide assurance that 100% of our customers' data remains secure. The impact of the incident was confined to occasional instances of sluggish performance and intermittent timeouts during page loading. Rest assured, there has been no compromise to the integrity or security of any data.

Mitigation:

We are actively implementing a comprehensive set of measures aimed at minimizing the likelihood of similar incidents occurring in the future and enhancing our ability to swiftly mitigate any disruptions should they arise. These efforts encompass both technical enhancements and organizational improvements to fortify our overall response capabilities and minimize the impact on our customers.

Technical Measures:

Enhancing our infrastructure scalability to accommodate sudden spikes in traffic and prevent system overloads.
Implementing more robust monitoring and alerting systems to detect anomalies and proactively address potential issues before they escalate.
Improving our caching mechanisms and load balancing strategies to optimize performance and minimize the risk of service degradation.
Conducting thorough reviews and updates to our security protocols to bolster resilience against malicious attacks and unauthorized access.

Organizational Improvements:

Enhancing our incident response procedures to streamline communication and coordination across teams, ensuring a more efficient and effective response to disruptions.
Providing ongoing training and awareness programs to our staff to strengthen their understanding of potential risks and best practices for mitigating them.
Establishing clearer escalation paths and decision-making frameworks to facilitate quicker resolution of critical issues.
These proactive measures underscore our unwavering commitment to safeguarding the stability and reliability of our services, and we remain dedicated to continually improving our processes to better serve customers.

‌

Timeline (CET):

25th April 17:09 - First automated alarm triggered indicating a breach in allowed load time threshold for communities.

17:11 - Incident response team was mobilized internally

17:14 - Incident was escalated due to severity

17:16 - The malicious traffic was identified and isolated from the normal traffic. A firewall adjustment made to counter it.

17:24 - The traffic on the platform returned to normal parameters and the disruption period was over

We greatly appreciate your patience and understanding throughout this incident. If you have any further questions or concerns, please don't hesitate to contact our support team at ccsupport@gainsight.com

Posted May 03, 2024 - 14:36 CEST

Resolved

Our team worked swiftly to mitigate the impact, and we are pleased to confirm that the incident is now resolved.

Incident Overview: A large scale DDoS attack temporarily disrupted our services, causing intermittent downtime and degraded performance for some users. We apologize for any inconvenience this may have caused. We can confirm that 100% of all customer data is safe, the impact was limited to some slowness & time outs in page loading.

Resolution: Our security and engineering teams successfully mitigated the attack, restoring normal service operations within 15 minutes when all systems were back to functioning as expected.

Preventive Measures: We are conducting a thorough investigation into the root cause of the attack to strengthen our defenses and mitigate future incidents. Additionally, we continuously monitor our network for any signs of suspicious activity to safeguard your data and ensure uninterrupted service.

We greatly appreciate your patience and understanding throughout this incident. If you have any further questions or concerns, please don't hesitate to contact our support team at ccsupport@gainsight.com

Posted Apr 26, 2024 - 09:09 CEST

Monitoring

The issue has subsided and we are continuing to monitor things

Posted Apr 25, 2024 - 17:46 CEST

Identified

The issue has been identified, the platform is returning to normal and we are monitoring things closely.

Posted Apr 25, 2024 - 17:32 CEST

Investigating

We are currently investigating this issue with the highest priority

Posted Apr 25, 2024 - 17:17 CEST

This incident affected: Status of our US Community Infrastructure (Status of our US Community Infrastructure).