Alert! - INC146254 - Interac e-Transfer Outage
Incident Report for Central 1
Postmortem

Postmortem: e-Transfer Outage on July 14th | INC146221 - P1

Summary:
On Wednesday, July 13th Central 1 released an update to our ActiveMQ servers used for queueing e-Transfers. The update caused a memory problem resulting in two outages on July 13th and 14th. The change was rolled back at 12:15 a.m. PT on July 15th recovering service stability.

Postmortem:
On Wednesday, July 13th Central 1 deployed changes to digital banking (CHG125476) and the Real Time Engine (RTE, CHG125440) to support an ActiveMQ upgrade. After the change was completed the memory consumption increased (PRB010947), causing several e-Transfer outages (MD, Forge, API, EEA) within the next 24 hours. Each outage was resolved by restarting services (ActiveMQ and PSA servers).

  1. On July 13th between 5:23 and 6:02 p.m. PT (8:23 to 9:02 p.m. ET) all e-Transfer services (EEA, Forge 2.0, MemberDirect and API) were unavailable for 39 minutes (INC146221).
  2. On July 14th between 11:10 to 11:55 a.m. PT (2:10 to 2:55 p.m. ET) there was another service outage for a total of 45 minutes (INC146254).
  3. Troubleshooting throughout July 14th failed to determine root cause, so it was agreed to roll-back CHG125440. This was set for 12:15 a.m. PT (3:15 a.m. ET) July 15, 2022. At 10:50 p.m. PT on the 14th, (July 15th at 1:50 a.m. ET) another manual restart was completed as the issue surfaced again, resulting on a brief 3-minute outage (INC146348).

After the rollback at 12:15 a.m. PT (3:15 a.m. ET) services were closely monitored throughout the 15th, with no spike in memory consumption or resources observed.

The point of failure was configurations in the ActiveMQ causing memory to fill to the point of server failure. A change was made to separate out the batch processing with online e-Transfer in the ActiveMQ (CHG126835) which has reduced memory consumption to acceptable levels.

Impact Assessment:
Affected Service(s): e-Transfers (V3.4 and lower)

Affected FI’s: All e-Transfer customers

Affected End Members: Estimated to be between 4-6k e-Transfer impacted

Impact windows: A total of 1 hour and 7 minutes

Central 1 Actions:
PRB010947 – ActiveMQ memory consumption root cause analysis – PS Software
Due Date: November 2022

RITM319782 – Improve QA performance testing for ActiveMQ changes– JR
Due Date: December 2023
• Need ways to improve our performance testing (capacity). Crash testing | testing limits of the system.

RITM319783 - Add the JMX Heap Memory monitoring to ongoing 24x7 alerting – Andrew
Due Date: December 2023
• Tried to set it up – created a bunch of false alarms – need to be tuned
• Start just by graphing the buckets

RITM319784 – Improve our release strategy to not complete changes in tandem – JR
Due Date: December 2023
• Decoupled the changes to for less complexity in releases/less dependencies.

RITM319785 – Improve monitoring of the e-Transfers service – Andrew
Due Date: December 2023
• Add the JMX Heap Memory monitoring to ongoing 24x7 alerting
• Add new/lower threshold monitoring for memory

RITM319787 – ActiveMQ Improvements – Andrew/Daryl
Due Date: December 2023
• Trace Logging - Move to dedicated consumer/MQ
• Consumer Capacity Management
• Tune the Heap/GC (20G heap is very big)
• Move towards 'Managed Active MQ on AWS/Azure

I apologies for the impact to your members e-Transfer service this incident caused. Part of Central 1’s Operational Excellence commitment is to improve our post-production testing of our releases. We know the actions we are taking from this postmortem will help to immediately mitigate incident impacts from servers’ releases like the ActiveMQ.

Central 1 completes a significant amount of testing for all of our Digital Banking core releases to help ensure we deploy bug free code mitigating unintended incidents from occurring. We will continue to strive for bug free releases. If you have any questions from this incident, please don’t hesitate to reach out to me.

Jason Seale
Director, Client Support Services

Posted Nov 04, 2022 - 21:14 PDT

Resolved
The system has been stable for the past few hours. We did see a few timeout errors however they were unrelated to the original incident. The outage was from 11:10 to 11:55 a.m. PT (2:10 to 2:55 p.m. ET).
Posted Jul 14, 2022 - 14:48 PDT
Monitoring
Service has been restored. We are monitoring.
Posted Jul 14, 2022 - 12:07 PDT
Investigating
Please be advised that Central 1 is experiencing an outage with e-Transfers from 11:16 am PT (2:16 pm ET). The outage affects MemberDirect, Forge, and API clients. Technical teams are engaged and are working to resolve the incident.

We will provide the next update at or before 12:45pm PT (3:45pm ET).

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Jul 14, 2022 - 11:46 PDT
This incident affected: Digital Banking Services and Incident Alerting.