Alert! - INC146386 – Several Central 1 Services Unavailable
Incident Report for Central 1
Postmortem

Postmortem: INC146386 – Central 1 Digital Banking, Payments and Treasury Service Outage

Summary:

On Friday, July 15, Central 1 experienced a data center outage between approximately 7:19 to 8:30 a.m. PT (10:19 to 11:30 a.m. ET). An incident occurred causing a significant disruption to applications for all Central 1’s clients including Retail and Small Business Banking, MemberDirect Business Banking, and e-Transfers. Central 1’s redundant services connections restored most services within the first hour, with the remainder of services and customers recovering by 9:30 a.m. PT (12:30 p.m. ET) based on different configurations and services.

Impact Assessment:

Affected Service(s): C1/IT services, Payments, Treasury and Digital Banking

Affected FI’s: All C1 clients

Opened: 2022-07-15 07:35 PT and resolved 2022-07-15 9:30 PT
• The Retail/Small Business outage duration was different for each credit union as impact depended on your geolocation, use of OAuth services and caching. All service impact began at 7:19 a.m. PT (10:20 a.m. ET). Services began recovering at 8 a.m. PT (11 a.m. ET) after webservices was failed over. The MDB outage was sporadically unavailable until the failover was completed at 9:20 a.m. PT (12:20 p.m. ET).
• The Central 1 e-Transfers outage lasted approximately 1 hour between 7:30 – 8:30 a.m. PT (10:30 to 11:30 a.m. ET).
• PaymentStream Direct (PSD) services including PS-AFT and PS-Wires was unavailable until 9:20 a.m. PT (12:20 p.m. ET).
o NOTE: services like e-Transfers, Wires and PS-AFT have queued services and incoming transfers were queued and successfully processed after services recovered.

Terminology:

F5 – F5 networks is a product-based company and produces load balancers (BIG-IP). A load balancer is a device that distributes network or application traffic across a cluster of servers. Load balancing improves responsiveness and increases availability of applications.
SSO – Single sign-on (SSO) is an authentication method that enables users to securely authenticate with multiple applications and websites by using just one set of credentials.
Site 24x7 – Site24x7 is a cloud-based website and server monitoring platform that helps monitor websites, servers, clouds, networks and applications.
PSD – PaymentStream Direct (PSD) is a user interface that provides access to Central 1’s Payment’s products, MemberDirect system support features, and related administration functions via a single sign-on to the secure site. The options you see on PaymentStream Direct are determined by the functions that your financial institution implements, and your personal access rights. Popular services are PS-Wires and PS-AFT (Automated Funds Transfer (AFT) as a debit or credit transaction).
UCP – Universal Connectivity Proxy (UCP) which provides XML/HTTP traffic routing. Typically used for Central 1 3rd party applications and services.
VIP – A virtual IP address (VIP or VIPA) is an IP address that does not correspond to a physical network interface. VIP addresses are also used for connection redundancy by providing alternative fail-over options for one machine.
Data center – A data center is a facility that centralizes an organization's shared IT operations and equipment for the purposes of storing, processing, and disseminating data and applications. Because they house an organization's most critical and proprietary assets, data centers are vital to the continuity of daily operations.
DNS – The Domain Name System (DNS) turns domain names (example www.central1.com) into IP addresses, which browsers use to load internet pages. Every device connected to the internet has its own IP address, which is used by other devices to locate the device.
Firewall – A Firewall is a network security device that monitors, and filters incoming and outgoing network traffic based on an organization's previously established security policies.

Postmortem:

On Friday, July 15, starting at 7:19 a.m. PT (10:19 a.m. ET), Central 1 experienced a data center outage in our Vancouver Data Center (VAHC) causing a significant disruption to applications for all Central 1’s clients including Retail and Small Business Banking, MemberDirect Business Banking, and e-Transfers as the F5 load balancer did not automatically fail active/active service over to our Toronto Data Center (TOHC). Central 1’s services began moving to their redundant connections and the IT coordinated manual failover of remaining services by 9:30 a.m. PT (12:30 p.m. ET). The root cause was not understood until after Central 1 had resolved the incident.

Root cause analysis:

The root cause was identified with the assistance of the F5 technical support team. The F5 team working with Central 1 network found a misconfiguration pushed live at 7:19 a.m. PT (10:19 a.m. ET); A Central 1 resource inadvertently applied a global configuration change to our F5 load balancer which stopped communication to our application services (CHG117457) in VAHC. The F5 change impacted the health check it performs on application servers to monitor their availability. The change caused many servers to appear "unhealthy", causing the F5 appliance to suspend traffic to them and affected multiple services. As this was a misconfiguration, and not an actual F5 device failure, the system did not automatically fail over to C1's High Availability standby device at the same Data Centre.
To resolve the incident, Central 1 teams manually removed the F5 monitors by application to remove the F5’s health checks. This was being completed one at a time and, prioritized by client facing applications in order of criticality with C1 internally-facing applications completed last.
High availability standby configuration in our F5 at our VAHC data centre was restored on Tuesday, July 19, at 1 a.m. PT (4 a.m. ET) after the weekend stability. Additionally, a temporary change freeze was put in place from the incident date until July 19. For further stability, further change restrictions are being contemplated for the month of August during peak staff absences.

Actions:

RITM312512 – Review definition and qualifications for “routine changes” – Change Manager
RITM312513 – Review mandatory fields for all changes – Change Manager
RITM312515 – Network F5 configuration setting improvements – Network
RITM312744 - Improve data center service redundancy – Business Continuity
RITM312516 - Update configuration to webservices VIP to be globally balanced – Platform

We recognize the significance of this service disruption and are committed to serving you better. We have taken the above actions to help ensure we can mitigate issues such as this from happening in the future. If you have any questions about the content of this postmortem, please contact me directly.

Sincerely,

Jason Seale, PMP
Director, Client Support Services

Posted Jul 29, 2022 - 10:48 PDT

Resolved
All Central 1 production and QA services across payment, treasury, and digital banking continue to remain stable.

Central 1 has identified the root cause and will be working diligently to prepare a comprehensive postmortem in the coming weeks to demonstrate the lessons learned and new procedures and tools we can implement to mitigate these occurrences.

Some of our services will remain in their current disaster recovery state to avoid any additional changes/risk to our infrastructure.

We will move back to our primary configuration on Tuesday, July 19, at 1 a.m. PT (4 a.m. ET).

Also, all changes are currently frozen at Central 1 until a review can be completed on Monday.

We understand the impact these outages have on your business and your members and will provide our update soon.
Posted Jul 15, 2022 - 14:17 PDT
Update
All payment, treasury, and digital banking services continue to remain stable.

Central 1 is working with our load balancing vendor on root cause and to ensure production stability. Central 1 teams will continue to monitor and investigate throughout the day.

We will provide another update in 2 hours.
Posted Jul 15, 2022 - 12:00 PDT
Monitoring
Central 1 has successfully recovered all payment, treasury and digital banking services.

We are currently working to recover our QA/Test environments.

A ticket has been opened with our load balancing vendor and the Central 1 team will continue our investigation throughout the day.

We will provide another update in 60 minutes.
Posted Jul 15, 2022 - 11:01 PDT
Update
Central 1 has successfully recovered most payments, treasury and digital banking services. We are currently working to recover our remaining services which includes Deposit Anywhere and Branch Capture.

e-Transfers and Digital Banking 2SV login have been working since 9:30 a.m. PT

We will provide another update in 30 minutes.
Posted Jul 15, 2022 - 10:33 PDT
Identified
Central 1 has identified a problem with our service load balancer. We have begun failing services over to our Eastern data centers and are seeing recovery for a significant portion of Retail and Small Business Banking, as well as real time payments.

PaymentStream Direct, CBS, and Treasury Connect are now accessible.
Posted Jul 15, 2022 - 09:45 PDT
Update
Technical teams continue to triage and will share another update within 30 minutes.
Posted Jul 15, 2022 - 09:03 PDT
Update
We are experiencing an infrastructure outage which is affecting the availability of many core services including Online Banking (both Forge and MemberDirect), PaymentStream Direct, User Management, Origination Solutions, Treasury Connect, and CBS Online.

We are continuing to triage and will share another update within 30 minutes.
Posted Jul 15, 2022 - 08:34 PDT
Investigating
Central 1 is aware that several services are unavailable affecting Online Banking, eTransfers, PaymentStream Direct... Technical teams are investigating with a high priority.

We will provide an update by 8:30 am PT (11:30 am ET).


Central 1 - Support@central1.com - DigitalBanking_Support@central1.com - 1.888.889.7878
Posted Jul 15, 2022 - 07:56 PDT
This incident affected: Treasury Services, Digital Banking Services, and Incident Alerting.