Issue Discovered - Service disruption in North American Region – Web User Interface
Incident Report for xMatters
Postmortem

What happened?

On June 21, 2021, at approximately 10:05 AM Pacific, the xMatters monitoring tools alerted Customer Support to an issue where the web user interface was unresponsive or exhibiting slow performance. During the incident, some customers may have noticed "Instance Unavailable" errors, or experience longer page loading time when accessing the web user interface. This issue only affected the web user interface; events continued to be accepted and created, and notifications and responses were processed normally.

Why did it happen?

This issue was caused by a single instance attempting to load approximately 140,000 user records into memory. This eventually increased memory usage to 100%, resulting in an unresponsive service. While the condition properly triggered an automated restart of the web user interface service, the service was unable to recover properly until the underlying issue could be mitigated.  

How did we respond?

As soon as Customer Support received the alert from the monitoring tools and confirmed the issue, they initiated a Severity-1 incident and gathered the major incident response team. The team identified the instance responsible for consuming resources and isolated it within a dedicated resource stack to prevent any potential recurrence. The team then manually cleared the cache and restarted the web user interface service, confirming that it had resumed normal operation. 

What are we doing to prevent it from happening again?

The Engineering team has isolated the source of the memory usage and reconfigured it with dedicated CPU and separate resources to eliminate future incidents of this type. They are currently developing additional memory clean up routines to further improve automated recovery, and investigating how the single instance was able to consume the available memory. Until these improvements are in place, the team will continue to isolate the source of the memory consumption.

Timeline:

Date/Time (Pacific) Action
Monday June 21, 2021 - 10:05 AM  xMatters monitoring alerts to slow or unresponsive customer instances
10:17 Severity-1 Incident initiated
10:20 Source of memory usage identified
10:22 Instance isolated and web UI service restarted
10:30 Web user interface service declared stable
10:45 Incident resolved

If you have any questions, please visit http://support.xmatters.com

No labels

Posted Jun 25, 2021 - 14:31 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jun 21, 2021 - 10:45 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Jun 21, 2021 - 10:35 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jun 21, 2021 - 10:32 PDT
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Posted Jun 21, 2021 - 10:22 PDT
This incident affected: North America (Web Interface).