Degraded performance and unavailability of the Infinity Portal
Incident Report for Check Point Services Status
Postmortem

Summary

Between Thursday, June 16, 2022, 06:53 UTC to 11:52 UTC, all users of the Infinity Portal (EU and US regions) experienced degraded performance of the login and using all the cloud applications hosted in the portal. The initial high latency, timeouts and slowness have escalated quickly to a complete outage.

The event was triggered by an extremely high load (10x) on our internal portal’s services caused internally from other applications within Check Point. Our internal alerting and client reports were clear to point on a major issue.

The high load caused database issues (Redis) and any effort to identify or stop the load has failed. The incident was mitigated eventually by wiping our cache which resolved its memory problems and made it possible for the services to become healthy again and cope with the load. The system then became stable again.

Incident Timeline

Thursday, June 16, 2022, 06:00 UTC – Reports start to gather on issues with using the portal.checkpoint.com. Also an alert is triggered. A war room is created to diagnose the issues.

Thursday, June 16, 2022, 06:53 UTC – It’s clear an incident has started. The status page is updated.

Thursday, June 16, 2022, 10:50 UTC – We identify a possible mitigation of wiping our Redis cache and applying a fix. The system shows signs of recovering.

Thursday, June 16, 2022, 11:52 UTC – It’s clear the system is back to being fully operational. No reported issues by clients for meaningful time. Closing the incident.

Root Cause Analysis

The Infinity Portal uses several authentication mechanisms. One of them is called “/auth/external” which is heavily used by many of Check Point’s services. An increase of usages with this route for several days has resulted with over-utilizing of our Redis cache memory which is a critical component of our portal’s functionality. It became degraded and causing the major outage. Its memory alert was unknowingly disabled thus causing a long time to be identified. Wiping the cache data allowed the recovery of our systems.

Next Steps

  1. Defining tasks for a major effort of wide stability and operability improvements to be implemented in the next weeks
  2. Fixing alerting of critical places in our system
  3. Better handling of the Redis’ data lifecycle
Posted Jun 17, 2022 - 10:45 UTC

Resolved
The incident has been resolved. Back to normal. We'll learn from that and improve. Thank you for your patience.
Posted Jun 16, 2022 - 11:52 UTC
Update
We are continuing to monitor for any further issues.
Posted Jun 16, 2022 - 11:26 UTC
Monitoring
A fix has been implemented and we are monitoring for impact. Metrics are improved. It seems we're back to normal.
Posted Jun 16, 2022 - 11:04 UTC
Update
Infinity Portal (and all cloud applications as a result) are suffering slow performance and unavailability due to internal high load
Posted Jun 16, 2022 - 07:00 UTC
Investigating
Customer may experience tunnel issues and slow response time, this evolves from an issue in infinity Portal
Posted Jun 16, 2022 - 06:53 UTC
This incident affected: Infinity Portal (Infinity Portal EU Region, Infinity Portal US Region) and Quantum Smart-1 Cloud (Quantum Smart-1 Cloud - EU Region).