Postmortem: e-Transfer Outage on July 14th | INC146221 - P1
Summary:
On Wednesday, July 13th Central 1 released an update to our ActiveMQ servers used for queueing e-Transfers. The update caused a memory problem resulting in two outages on July 13th and 14th. The change was rolled back at 12:15 a.m. PT on July 15th recovering service stability.
Postmortem:
On Wednesday, July 13th Central 1 deployed changes to digital banking (CHG125476) and the Real Time Engine (RTE, CHG125440) to support an ActiveMQ upgrade. After the change was completed the memory consumption increased (PRB010947), causing several e-Transfer outages (MD, Forge, API, EEA) within the next 24 hours. Each outage was resolved by restarting services (ActiveMQ and PSA servers).
After the rollback at 12:15 a.m. PT (3:15 a.m. ET) services were closely monitored throughout the 15th, with no spike in memory consumption or resources observed.
The point of failure was configurations in the ActiveMQ causing memory to fill to the point of server failure. A change was made to separate out the batch processing with online e-Transfer in the ActiveMQ (CHG126835) which has reduced memory consumption to acceptable levels.
Impact Assessment:
Affected Service(s): e-Transfers (V3.4 and lower)
Affected FI’s: All e-Transfer customers
Affected End Members: Estimated to be between 4-6k e-Transfer impacted
Impact windows: A total of 1 hour and 7 minutes
Central 1 Actions:
PRB010947 – ActiveMQ memory consumption root cause analysis – PS Software
Due Date: November 2022
RITM319782 – Improve QA performance testing for ActiveMQ changes– JR
Due Date: December 2023
• Need ways to improve our performance testing (capacity). Crash testing | testing limits of the system.
RITM319783 - Add the JMX Heap Memory monitoring to ongoing 24x7 alerting – Andrew
Due Date: December 2023
• Tried to set it up – created a bunch of false alarms – need to be tuned
• Start just by graphing the buckets
RITM319784 – Improve our release strategy to not complete changes in tandem – JR
Due Date: December 2023
• Decoupled the changes to for less complexity in releases/less dependencies.
RITM319785 – Improve monitoring of the e-Transfers service – Andrew
Due Date: December 2023
• Add the JMX Heap Memory monitoring to ongoing 24x7 alerting
• Add new/lower threshold monitoring for memory
RITM319787 – ActiveMQ Improvements – Andrew/Daryl
Due Date: December 2023
• Trace Logging - Move to dedicated consumer/MQ
• Consumer Capacity Management
• Tune the Heap/GC (20G heap is very big)
• Move towards 'Managed Active MQ on AWS/Azure
I apologies for the impact to your members e-Transfer service this incident caused. Part of Central 1’s Operational Excellence commitment is to improve our post-production testing of our releases. We know the actions we are taking from this postmortem will help to immediately mitigate incident impacts from servers’ releases like the ActiveMQ.
Central 1 completes a significant amount of testing for all of our Digital Banking core releases to help ensure we deploy bug free code mitigating unintended incidents from occurring. We will continue to strive for bug free releases. If you have any questions from this incident, please don’t hesitate to reach out to me.
Jason Seale
Director, Client Support Services