Between Thursday, June 16, 2022, 06:53 UTC to 11:52 UTC, all users of the Infinity Portal (EU and US regions) experienced degraded performance of the login and using all the cloud applications hosted in the portal. The initial high latency, timeouts and slowness have escalated quickly to a complete outage.
The event was triggered by an extremely high load (10x) on our internal portal’s services caused internally from other applications within Check Point. Our internal alerting and client reports were clear to point on a major issue.
The high load caused database issues (Redis) and any effort to identify or stop the load has failed. The incident was mitigated eventually by wiping our cache which resolved its memory problems and made it possible for the services to become healthy again and cope with the load. The system then became stable again.
Thursday, June 16, 2022, 06:00 UTC – Reports start to gather on issues with using the portal.checkpoint.com. Also an alert is triggered. A war room is created to diagnose the issues.
Thursday, June 16, 2022, 06:53 UTC – It’s clear an incident has started. The status page is updated.
Thursday, June 16, 2022, 10:50 UTC – We identify a possible mitigation of wiping our Redis cache and applying a fix. The system shows signs of recovering.
Thursday, June 16, 2022, 11:52 UTC – It’s clear the system is back to being fully operational. No reported issues by clients for meaningful time. Closing the incident.
The Infinity Portal uses several authentication mechanisms. One of them is called “/auth/external” which is heavily used by many of Check Point’s services. An increase of usages with this route for several days has resulted with over-utilizing of our Redis cache memory which is a critical component of our portal’s functionality. It became degraded and causing the major outage. Its memory alert was unknowingly disabled thus causing a long time to be identified. Wiping the cache data allowed the recovery of our systems.