Increased failure rate for the US Infinity Portal
Incident Report for Check Point Services Status
Postmortem

Summary

Between Friday, Sep 09, 2022, 15:31 UTC to 17:40 UTC, most users of the Infinity Portal in the US region have experienced a partial outage of the login and usage of cloud applications hosted in the portal.

The event was triggered by an AWS Systems Manager incident in the N. Virginia region affecting a critical component in our system that consumes its configuration and credentials from the AWS SSM. Our internal alerting and client reports were clear to point on a major issue.

We have managed to quickly work around that dependency and recovered our system.

Incident Timeline

Friday, Sep 09, 2022, 15:30 UTC – Reports start to gather on issues with using the portal.checkpoint.com in the US region. Also an alert is triggered. A war room is created to diagnose the issues.

Friday, Sep 09, 2022, 15:36 UTC – AWS posts for the first time about an SSM incident.

Friday, Sep 09, 2022, 16:55 UTC – Root cause and scope of the incident are clear. The status page is updated.

Friday, Sep 09, 2022, 17:25 UTC – We are applying a work around to avoid the dependency with SSM and recover the system. The system shows signs of recovery.

Friday, Sep 09, 2022, 18:01 UTC – It’s clear the system is back to being fully operational. No reported issues by clients for meaningful time. Closing the incident.

Root Cause Analysis

The Infinity Portal uses several managed services by AWS. One of them is AWS Systems Manager which is used in one critical component of the system to load configuration and secrets upon startup. Once it became degraded that critical component malfunctioned.

Next Steps

  1. Defining a clear playbook to being able to recover from an SSM outage in minutes.
  2. Improve alerting of malfunctions with SSM
Posted Sep 09, 2022 - 19:48 UTC

Resolved
This incident has been resolved.
Posted Sep 09, 2022 - 18:01 UTC
Monitoring
A fix has been implemented to workaround the dependency with AWS System Manager. The portal seem to be back to normal. We are monitoring client reports to make sure.
Posted Sep 09, 2022 - 17:48 UTC
Identified
The specific components affected by the AWS System Manager incident have been identified and we are applying a workaround to avoid being affected. Hoping to resolve in the next hour.
Posted Sep 09, 2022 - 16:56 UTC
Investigating
Due to an AWS Systems Manager outage in the US, serving our configuration management system of the infinity portal, we are experiencing a partial service disruption and increased failure rate serving the Infinity Portal to the US clients.
RSS of the AWS incident: https://health.aws.amazon.com/health/status
Posted Sep 09, 2022 - 16:55 UTC
This incident affected: Infinity Portal (Infinity Portal US Region).