Postmortem: February 24th, RSA Service Degradation | INC153691 P2
Central 1 published a postmortem for RSA latency and service degradation on January 21st, February 1st and February 7th (INC153074). The RSA latency experienced on Friday, February 24th (INC153691) between 9:50 a.m. and 1 p.m. PT (12:50 to 4 p.m. ET) and again on Monday, February 27th (INC153775) from 8:55 a.m. to 12:50 p.m. PT (11:55 a.m. to 3:50 p.m. ET) was a different point of failure, caused by similar contributors with database size and query efficiencies. Services on both days recovered on their own.
Our RSA services never experienced an outage but intermittent service degradation:
On Friday, the point of failure was believed that a Network Interface Card (NIC) on a SQL Virtual Machine (VM) had become saturated slowing RSA queries to the database. On Friday, February 24th we completed an urgent change to guarantee network capacity for the backend database server. As the incident resurfaced on Monday, we continued with a series of actions to help alleviate the load on RSA. Our diagnosis for point of failure was linked to RSA volumes; specifically affected from persisting brute force attacks, systemic growth on 2SV as well as the additional controls within RSA that FI’s are implementing to combat increased cyber fraud activity.
On Monday February 27th Central 1 began to reduce the RSA database (DB) size (decreasing the DB retention policy) while increasing the data query efficiencies, Central 1 delayed new 2SV launches for 3 weeks to ensure key resources were freed up to assist with the pending RSA stability actions which included a full review of RSA rulesets for effectiveness and systemic impacts, evaluating the transitioning of RSA to independent SQL structure and adding in new Web application Firewall (WAF) rules on OAuth to reduce volumes that can reach the RSA service.
With the RSA upgrade which was completed on March 19th (https://www.secure.central1.com/News/Pages/Cyber%20Security/Increased-Authentication-RSA-Outseer-Application-Upgrade-March-2023.aspx), and our RSA service changes, we have seen an increase overall performance, and the ability to sustain high peak traffic without latency. Central 1 will continue to complete our review of the service as we look for additional efficiencies in our roadmap.
Actions:
PRB011066 – Root cause analysis for RSA service degradation
Assigned to: Product
Due date: COMPLETED
RITM331022 – Review of RSA services to independent SQL cluster
Assigned to: Platform
Due Date: by end of April
RITM331024 – RSA Policy/Rule Cleanup
Assigned to: Product/Cyber Fraud Support
Due Date: by end of Q2
PRB011075 - Ongoing RSA Performance Analysis by Product Management
Assigned to Product:
Due date: COMPLETED
Central 1 recently experienced a service disruption that caused inconvenience and frustration for some of our customers. We want to assure all of our customers that we are fully committed to improving our service delivery and taking the necessary steps to prevent similar disruptions in the future. We have conducted a thorough postmortem analysis of the incident and identified several areas where we can make improvements.
If you have any questions about this postmortem, please reach out to me to discuss.
Jason Seale
Director, Client Support Services
jseale@central1.com | 778.558.5627