Alert! INC153074 - OAuth/2SV Login Unavailable
Incident Report for Central 1
Postmortem

Postmortem: RSA OAuth and 2SV Recent Degradation | INC153074

Central 1’s RSA service which is used for our Increased Authentication and 2SV services experienced degradation on January 21st for 49 minutes (INC152479), February 1st for 2 hours and 11 minutes (INC152835) and February 7th for 51 minutes (INC153074). During service degradation, features reliant on the RSA risk score and/or requiring the RSA database for decision-making were unavailable. The impact to members was:

  • The fail-open configuration allows members to login without RSA service, therefore there was no availability impact for desktop login fail open clients. This would mean high risk scores would not be prompted for step up authentication during this time. Clients configured to fail closed would not have been able to login to Online banking.
  • Other services dependent on the risk scoring would have been impacted as follows: 

    • Primary impacts are to e-Transfers (v3.4), where we experienced a 90% outage on send and add contact functions. e-Transfer receive was not impacted and e-Transfers for Business (v3.5) was not impacted during the service degradation.
    • Bill Payments experienced a 60% degradation in sending bill payments. The impacts to bill payments were specifically on payments to high-risk billers (like credit cards which are scored).
    • Biometric login experienced an outage, although members could have completed an intuitive non-biometric login.
    • Transactional step-up customers would have had e-Transfer and Bill Payments failing with members receiving an error message.
    • Forge 2.0 customers would have had their left-hand navigation failing as there is a dependent call to the RSA database required to ensure the proper configuration for the navigation loads. 

Known point of failure: The RSA database receives weekly index maintenance to optimize database performance. This job runs on Saturdays starting at 1:30 a.m. PT (4:30 a.m. ET). This job typically takes 4 to 5 hours to complete, and it does not interrupt production. On Saturday January 21st, the job did not finish until 7:55 a.m. (10:55 a.m. ET), taking 6 hours 24 min to complete. In review of the incident, our database administration team determined there was a collision of our indexing job and another maintenance job that was scheduled to run between 7:05 to 7:55 a.m. PT (10:05 to 10:55 a.m. ET). The second job (to update statistics) caused a block for records into the database, preventing scoring services to work. The extended indexing job has been slowly increasing in duration over time as the database has been increasing in size. It does have data purged after 13 months, but with the increase of 2SV adoption, and the increased volume of brute force attacks (which creates records) the table has grown large enough that a review for optimization needs to be completed with our vendor. 

On February 1st and 7th, the point of failure has been attributed to an application threading problem caused by a Drools bug. Drools provides our decision engine processing rules. Our current version has a known bug that can cause database connection pools to become locked (not closing connections). When connections to the database reaches 100% it prevents the service from performing service requests. To help mitigate this problem Central 1 has added 4 additional production RSA (MDAuth) servers which will this reduce the chance of the threading issue recurring. Our long-term solution is an RSA and Drools version upgrade (CHG131855 | SVM-2612).

 Central 1 is also implementing Dynatrace in QA on the new production servers to help improve our monitoring and proactive triage of errors.

 The root cause for the point of failure, Drools bug, is still unknown. Recent increases in RSA traffic due to continuing implementations, new Policies and new cases are contributors, but other factors which may have pushed the service passed a daily traffic threshold causing the Drools bug to be realized are:

  • Large scale brute force attacks creating millions of new records in our RSA database.
  • Implementing large new clients onto RSA via our ongoing 2SV rollout, as well as new RSA policies implemented by clients have also increased the overall records.
  • We are also looking at our current site monitoring tool, Site 24x7 to implement an aggregator cookie as the logins every 5 minutes add new device records. 

Central 1 completed our recent health check with a 3rd party vendor (Saviium in 2021) for our RSA services. This health check looks at our overall RSA service, but we believe adding so many changes at once between reviews helped lead to the realized degradation of service. Central 1 is working on several initiatives highlighted in our actions below to mitigate further impacts while we upgrade our services and build a strong roadmap going forward for service stability.

 

Actions:

PRB011044 - RSA service degradation root cause analysis

Assigned to: Product Management

Due date: Closed

 RITM327058 – Build RSA service/DB monitoring/alerting

Assigned to: Platform
Due Date: Closed

 RITM329455 – RSA Improvements Roadmap

Assigned to: Product/Bart

Due Date: End of April 2023

 PRB011075– Ongoing RSA Performance Analysis by Product Management

Assigned to: Bart Venlet

Due Date: End of April 2023

 PRB011076 – Ongoing RSA Performance Analysis by Platform

Assigned to: Quintin Paulson

Due Date: End of April 2023

 

At our company, we take the quality of our service delivery very seriously. We understand that when our services are not working as expected, it can have a significant impact on our customers and their businesses. We believe that the best way to address any issues that arise is to be transparent about them and work diligently to improve our processes and systems. Our upcoming RSA upgrade and improvement/stability changes along with improved monitoring and reviews will mitigate the changes of such incidents occurring again.

If you have any questions about this postmortem please reach out to me directly.

 

Jason R Seale

Director of Client Support Services

jseale@central1.com | 778.558.5627

Posted Mar 09, 2023 - 16:13 PST

Resolved
Performance has remained stable. A post mortem will be provided in the next few weeks.
Posted Feb 07, 2023 - 16:45 PST
Monitoring
OAuth service has been restored. We will continue to monitor system stability throughout the day.
Posted Feb 07, 2023 - 14:32 PST
Update
C1 technical teams are fully engaged are working to restore service. We will provide our next update by 3:00 pm PT (6:00 pm ET).
Posted Feb 07, 2023 - 14:24 PST
Investigating
Central 1 is aware that OAuth Login is currently down. Members will see a "HOST Timed Out" error. Please note that biometric login is unaffected on mobile app.

Technical teams are investigating with high priority.

We will provide an update by 2:30 pm PT (5:30 pm ET).

Central 1 - Support@central1.com - DigitalBanking_Support@central1.com - 1.888.889.7878
Posted Feb 07, 2023 - 13:34 PST
This incident affected: Incident Alerting.