Postmortem: Incident – INC157454 – CAS Production Day End Job failed - High (P2)
Overview
On Thursday, April 20, 2023, the new banking system CAS sent a PagerDuty Alert at 4 a.m. PT to advise the CAS production end-of-day (EOD) failed. The EOD job was run twice in error, causing the transactions on the banking reports and files along with CAS online to display April 21st instead of April 19th.
Due to the date error, all the banking reports were removed from the FTP folders, and the banking system was brought offline while the database was restored to the 4 a.m. PT (7 a.m. ET) snapshot. Once the restore was complete the EOD job was run, which set the database to the correct date. Once the date was confirmed in CAS, the reports, PREC, and OREC files were rerun and placed in the credit union’s FTP folder.
When CAS was taken offline to restore the snapshot, credit unions could not access CBS Online. Outgoing wires could not post to CAS (INC157479) which delayed wires from being sent out to other FIs.
CAS was offline for 1 hour and 31 minutes from 5:13 a.m. to 6:44 a.m. PT (8:13 a.m. ET to 9:44 a.m. ET).
The root cause of this incident is the CAS end-of-day step failed and did not retry due to a missing exception lock. During recovery the team decided to restart the job normally with the expectation that the job would resume from the last failed step however, the job started from the very beginning causing the day end to run again and move to the following day, resulting in the incorrect system date.
Actions
Pending
RITM333831- (CASM-3674) Application enhancements to error handling on the end-of-day job – Q2
Complete
PRB011107- What caused the EOD job to fail?
The CAS end-of-day step failed and did not retry due to a missing exception lock. During recovery the team decided to restart the job normally with the expectation that the job would resume from the last failed step however, the job started from the very beginning causing the day end to run again and move to the following day.
CHG135404 – Deployed hotfix/1.17.11 to production on Thu, Apr 20
· fix the day-end locking issue.
· In case of exception in Jobs during the update of Accounts_Net_Totals retry is not working
RITM334278 - Update the Standard Operating Procedures to clearly indicate how to re-run EOD