OpenText’s postmortem below:
Incident Summary:
On 7 March 2022, Central 1’s external websites became inaccessible due to an increased load on the LiveSite application. OpenText support teams restored service approximately 75 minutes after receiving alerts of the incident.
How was the client impacted?
Customer websites were not available to visitors.
What Services were impacted?
LiveSite
Why did the incident occur?
Code inefficiencies (recursive logic) produced a large increase in requests, overloading the LiveSite nodes.
How was the incident resolved?
The support team restarted the system multiple times but services went down immediately after each restart. The team then increased the number of LiveSite nodes from 4 to 8. With this increase in resources, the LiveSite application was restored.
Preventative Actions
Implement auto-scaling for LiveSite nodes. This will automatically add resources to handle increased loads. Target: April 29, 2022
Eliminate recursive logic in custom code. (Completed, waiting to be deployed to PROD.) Target: April 1, 2022
Notes There was a similar incident affecting LiveSite availability on Feb. 1, 2022. We modified custom code to limit the depth of recursive rendering requests after that event, but this limit was not adequate for the 4 LiveSite nodes in use at the time of the March 7 incident. We believe the 8 nodes now in place will support any recursive requests as the limit is still in place and we observed at most 6 simultaneous recursive requests—not infinite recursion. This recursion will be eliminated from the code in the next deployment.
~OpenText