Alert! -INC151943 - Interac E-transfers error 910 and 914
Incident Report for Central 1
Postmortem

Postmortem: e-Transfer service degradation | INC151943 P2

Summary:
On Wednesday, January 4, e-Transfer services experienced significant degradation in service shortly after 6 p.m. PT (9 p.m. ET) until 6:40 p.m. PT (9:40 p.m. ET). During this time, up to 70% of online customers would have experienced generic “Try again later” error messages when sending or receiving e-Transfers due to latency within the e-Transfer system. A restart of the application was completed to fully recover service.

Postmortem:
On Wednesday, January 4, the very first alert from Central 1’s monitoring was sent exactly at 5:30 p.m. PT (8:30 p.m. ET). This alert provided steps to validate the 910/914 errors. Although there were 910/914 errors before 5:30 p.m. PT (8:30 p.m. ET), it was less then 1% of transactions. During the investigation, latency in the e-Transfer systems started to increase to approximately 5% and additional support teams were engaged to investigate. At 6:05 p.m. PT (9:05 p.m. ET), an incident manager was engaged and a P2 incident was opened as the downstream systems became impacted from the latency and service degradation went from <5% to ~70%.

A client notification was posted to the Central 1 Status Page at 6:28 p.m. PT. A decision was made to restart the application identified that was causing the downstream impacts at 6:40 p.m. PT (9:40 p.m. ET), which resolved the incident. The restart did not stop e-Transfer service and allowed services to recover, avoiding a complete outage. After system stability was observed, our resolution notice was posted to the Central 1 Status Page at 6:54 p.m. PT (9:54 p.m. ET).

Known point of failure:
Central 1 is still investigating (PRB011030) the point of failure to determine the root cause. At this time, in reviewing our major systems within our e-Transfer architecture, there were no changes implemented. There were no spikes in performance, no saturation of services, increase in memory or any other signs for point of failure.

A deep analysis was completed on all participating systems used within the e-Transfer infrastructure including network, servers, applications, and databases. At this time, Central 1 can identify a point of failure within our EMT Application, a piece within the e-Transfer system that became latent due to a high CPU load. When reviewing the CPU load on our EMT application, it was identified to be growing steadily over a period of 60 days, although never above any known critical threshold (maximum reached 40% on the day of the e-Transfer incident before spiking). This led the investigation team to review a threading problem where not all threads were being released on the application. Our investigation determined that a monthly restart of the application of the e-Transfer system will prevent any threading or CPU problems while we continue our investigation (see “Actions” below).

New logging has been added to assist with our continuing triage.

Impact Assessment:
Affected Service(s): e-Transfers send and receive
Affected FI’s: All e-Transfer (ISO8583) clients
Affected End Members: ~8000
Incident opened at 2023-01-04 18:07 PT and resolved 2023-01-04 18:45 PT

Actions:
PRB011030 – Ongoing e-Transfer Outage Investigation
• Due by: End of Q1 2023
• Continue to review threading problem for permanent fix.

RITM325404 – e-Transfer system monitoring
• Due by: End of Q1 2023
• Implement additional monitoring to monitor for, and provide greater advanced notice on a recurrence of this incident.
• Review current monitoring thresholds of error limits.
• Include additional instructions in alerting for quicker triage.

RITM325405 - Risk Registry Update
• Due by: End of February
• For new features – stress test environments vs projected load, and document limitations, and implement monitoring to detect when limitations are being approached.

RITM325406 - Improvements to the e-Transfer system
• Due by: End of Q1 2023
• Improve application resiliency in the event of component failure.

RITM325407 – Improves to the ActiveMQ for managing e-Transfer messages
• Due by: End of Q1 2023
• Improve EMT servers and ActiveMQ server load management.


Central 1 is continuing to improve our application resiliency on all services and products through ongoing evaluation and improvement initiatives. We are also working with priority on improving our incident response, so that in case of service disruption, we can respond faster, communicate with improved clarity and urgency, and reduce system downtime.
Please reach out to me if you require any further information concerning this incident.

Jason Seale
Director, Client Support Services
jseale@central1.com | 778.558.5627

Posted Jan 20, 2023 - 09:08 PST

Resolved
Please note services have been restored as of 6:45 pm PT (9:45 pm ET). An emergency EMT app restart was performed and that resolved the incident. Please contact Support if you experience any further issues.
Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Jan 04, 2023 - 18:54 PST
Investigating
Central 1 is observing Interac timeouts and errors 914 and 910 that are showing while users are sending and receiving E-transfers, starting at approximately 5:00 pm. PT (8:00 p.m. ET) today.
Central 1 is actively investigating and will provide an update on or before 7:30 pm a.m. PT (10:30 p.m. ET).


Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Jan 04, 2023 - 18:28 PST
This incident affected: Digital Banking Services and Incident Alerting.