On Friday, July 15, Central 1 experienced a data center outage between approximately 7:19 to 8:30 a.m. PT (10:19 to 11:30 a.m. ET). An incident occurred causing a significant disruption to applications for all Central 1’s clients including Retail and Small Business Banking, MemberDirect Business Banking, and e-Transfers. Central 1’s redundant services connections restored most services within the first hour, with the remainder of services and customers recovering by 9:30 a.m. PT (12:30 p.m. ET) based on different configurations and services.
Affected Service(s): C1/IT services, Payments, Treasury and Digital Banking
Affected FI’s: All C1 clients
Opened: 2022-07-15 07:35 PT and resolved 2022-07-15 9:30 PT
• The Retail/Small Business outage duration was different for each credit union as impact depended on your geolocation, use of OAuth services and caching. All service impact began at 7:19 a.m. PT (10:20 a.m. ET). Services began recovering at 8 a.m. PT (11 a.m. ET) after webservices was failed over. The MDB outage was sporadically unavailable until the failover was completed at 9:20 a.m. PT (12:20 p.m. ET).
• The Central 1 e-Transfers outage lasted approximately 1 hour between 7:30 – 8:30 a.m. PT (10:30 to 11:30 a.m. ET).
• PaymentStream Direct (PSD) services including PS-AFT and PS-Wires was unavailable until 9:20 a.m. PT (12:20 p.m. ET).
o NOTE: services like e-Transfers, Wires and PS-AFT have queued services and incoming transfers were queued and successfully processed after services recovered.
F5 – F5 networks is a product-based company and produces load balancers (BIG-IP). A load balancer is a device that distributes network or application traffic across a cluster of servers. Load balancing improves responsiveness and increases availability of applications.
SSO – Single sign-on (SSO) is an authentication method that enables users to securely authenticate with multiple applications and websites by using just one set of credentials.
Site 24x7 – Site24x7 is a cloud-based website and server monitoring platform that helps monitor websites, servers, clouds, networks and applications.
PSD – PaymentStream Direct (PSD) is a user interface that provides access to Central 1’s Payment’s products, MemberDirect system support features, and related administration functions via a single sign-on to the secure site. The options you see on PaymentStream Direct are determined by the functions that your financial institution implements, and your personal access rights. Popular services are PS-Wires and PS-AFT (Automated Funds Transfer (AFT) as a debit or credit transaction).
UCP – Universal Connectivity Proxy (UCP) which provides XML/HTTP traffic routing. Typically used for Central 1 3rd party applications and services.
VIP – A virtual IP address (VIP or VIPA) is an IP address that does not correspond to a physical network interface. VIP addresses are also used for connection redundancy by providing alternative fail-over options for one machine.
Data center – A data center is a facility that centralizes an organization's shared IT operations and equipment for the purposes of storing, processing, and disseminating data and applications. Because they house an organization's most critical and proprietary assets, data centers are vital to the continuity of daily operations.
DNS – The Domain Name System (DNS) turns domain names (example www.central1.com) into IP addresses, which browsers use to load internet pages. Every device connected to the internet has its own IP address, which is used by other devices to locate the device.
Firewall – A Firewall is a network security device that monitors, and filters incoming and outgoing network traffic based on an organization's previously established security policies.
On Friday, July 15, starting at 7:19 a.m. PT (10:19 a.m. ET), Central 1 experienced a data center outage in our Vancouver Data Center (VAHC) causing a significant disruption to applications for all Central 1’s clients including Retail and Small Business Banking, MemberDirect Business Banking, and e-Transfers as the F5 load balancer did not automatically fail active/active service over to our Toronto Data Center (TOHC). Central 1’s services began moving to their redundant connections and the IT coordinated manual failover of remaining services by 9:30 a.m. PT (12:30 p.m. ET). The root cause was not understood until after Central 1 had resolved the incident.
The root cause was identified with the assistance of the F5 technical support team. The F5 team working with Central 1 network found a misconfiguration pushed live at 7:19 a.m. PT (10:19 a.m. ET); A Central 1 resource inadvertently applied a global configuration change to our F5 load balancer which stopped communication to our application services (CHG117457) in VAHC. The F5 change impacted the health check it performs on application servers to monitor their availability. The change caused many servers to appear "unhealthy", causing the F5 appliance to suspend traffic to them and affected multiple services. As this was a misconfiguration, and not an actual F5 device failure, the system did not automatically fail over to C1's High Availability standby device at the same Data Centre.
To resolve the incident, Central 1 teams manually removed the F5 monitors by application to remove the F5’s health checks. This was being completed one at a time and, prioritized by client facing applications in order of criticality with C1 internally-facing applications completed last.
High availability standby configuration in our F5 at our VAHC data centre was restored on Tuesday, July 19, at 1 a.m. PT (4 a.m. ET) after the weekend stability. Additionally, a temporary change freeze was put in place from the incident date until July 19. For further stability, further change restrictions are being contemplated for the month of August during peak staff absences.
RITM312512 – Review definition and qualifications for “routine changes” – Change Manager
RITM312513 – Review mandatory fields for all changes – Change Manager
RITM312515 – Network F5 configuration setting improvements – Network
RITM312744 - Improve data center service redundancy – Business Continuity
RITM312516 - Update configuration to webservices VIP to be globally balanced – Platform
We recognize the significance of this service disruption and are committed to serving you better. We have taken the above actions to help ensure we can mitigate issues such as this from happening in the future. If you have any questions about the content of this postmortem, please contact me directly.
Sincerely,
Jason Seale, PMP
Director, Client Support Services