Alert! - INC150645 Forge Public Website OpenText outage
Incident Report for Central 1
Postmortem

Postmortem: INC150645 OpenText - Public Website outage – P2

Summary:
On Thursday, November 17, 2022 a t approximately 10:55 a.m. PT (1:55 p.m. ET) Central 1 clients that use the OpenText Public Website (PWS) service started to experience high latency when trying to reach their public website. The latency persisted throughout the day with several intermittent outages. Services fully recovered for all clients at 7:15 p.m. PT (10:10 p.m. ET). This incident did not directly impact Online Banking, as the websites are independently hosted, however many customers navigate to the Online Banking login portlet via the PWS pages, which led to a reduction in desktop banking during this Incident.

Postmortem:
On Thursday, November 17, 2022 at approximately 10:55 a.m. PT (1:55 p.m. ET) Central 1 clients using OpenText Public Website (PWS) service started to experience high latency on the OpenText platform. The latency was restricted to the initial launch of the site in the customers session, and in most (~85%) cases the site would render after ~20 seconds of latency, and the remainder of the session was unaffected. In the other 15% of cases, the PWS would not render for the customer and the initial loading of the site would timeout.

By 11:15 a.m. PT (2:15 p.m. ET) our site 24x7 monitoring tool detected that sporadic site s were failing, triggering our monitoring alerts. Digital Banking Support started to receive a spike in phone calls alerting that customers could not access their credit union's website. In response a priority 2 incident was raised and Central 1 product and platform teams moved to escalate the incident with OpenText, our vendor who manages our Forge PWS platform.

This incident did not impact Online Banking and Mobile App. were still available with an approximate 20% reduction of desktop login traffic to online banking. Note customers would have needed to bookmark the online banking login page to avoid the public website latency/outage. The latency for sites was very intermittent throughout the incident, suggesting that the root cause was some sort of possible volume problem (increased load) or cycling of pods (causing reduction of available capacity) thus increasing latency.

At 12:30 p.m. PT (3:30 p.m. ET), Central 1 called an escalation meeting with all OpenText resources to review their triage and assist with this incident. OpenText is a managed service, therefore we are heavily reliant on their triage, analysis and decision making.

The OpenText team could not locate any point of systemic failure and recycled some websites with no improvement to the latency/outages. One suspected point of failure was possible a bad file pushed live by a client (unknown conditions to cause it to be a ‘bad file’), so between 1 to 5 p.m. PT (4 to 8 p.m. ET), all websites’ changes pushed live that morning were reviewed, one at a time, with the susceptible files reverted.

A decision was made at 5:40 p.m. PT (8:40 p.m. ET) to take down ALL sites. If the problem was resources that couldn’t come out of a stuck cycle to recover services, then only removing all load would help. Central 1 took down all Forge PWS and began bringing the slowly up under close inspection. All websites recovered with no latency by 7:15 p.m. PT (10:15 p.m. ET).

The investigation teams reconvened the next morning at 9:30 a.m. PT (12:30 p.m. ET) to review Friday morning stability. On Friday morning all PWS performance statistics were green, and services remained stable.

Current point of failure: The C1 and OpenText teams believe that the live pods may have transitioned into a bad state due to too much load (either external or publishing across all clients). When a pod becomes unhealthy it auto restarts. As this process persisted the entire cluster moved into a state where the continual restarting of pods put too much load on the other available pods, and the system was in an insufficient capacity cycle without the ability to full recover. OpenText is also investigating other possible root causes including threading, as well as the potential for a bad workflow going live. Please see “Actions” below for the pending “OpenText” analysis.

Impact Assessment:
Affected Service(s): Public websites

Affected FI’s: All Forge Clients

Affected End Customers: Unknown

Impact windows: 10:55 a.m. to 7:15 p.m. PT (1:55 to 10:15 p.m. ET)

Central 1 Actions:

Product and Vendor Management to coordinate with OpenText
PRB011012 – Central 1 ongoing investigation into PWS outage
PRB011013 - OpenText ongoing investigation into PWS outage
Due Date: By end of 2022

RITM321765 – Review C1’s OT Architecture
Due date: End of January 2023
• Review the current tenant instances
• Complete a code review on multitenancy architecture

RITM321768 – Review and update products Threat Risk Analysis and Third-Party Risk Analysis reports
Due Date: By end of 2022

RITM321769 – Vendor Management improvements for our OpenText support model
Due date: End of January 2023
• Confirm C1’s access to hourly OpenText logs
• Reviewing master agreements for support improvements.

RITM321772 – Ongoing new performance testing
Due Date: End of January 2023
• Review pod and thread count thresholds under performance strain.
• Attempt to clear the cache while there is load
• Review Queue and drip protects

We understand the severity that such a long outage of your public website has on your reputation and your members ability to perform the expected, reliable online banking and viewing of your sites content.

We are working with our vendor in a full review of the service support, architecture review and service resiliency planning. Central 1 is reviewing how we can improve this relationship and support model and expect to have quick turnaround on our deliverables.

If you have any questions please do not hesitate to reach out to me.

Jason Seale, PMP
Director, Client Support Services

Central 1 Client Support Services
1441 Creekside Drive, Vancouver, BC, Canada V6J 4S7
T 1 800 661 6813 ext. 5185 C 778 558 5627 Support 888 889 7878
jseale@central1.com www.central1.com

Posted Dec 01, 2022 - 16:14 PST

Resolved
The Forge 2 Public Website has remained stable throughout the day. We have already passed peak load for today and all systems are healthy.
C1 will continue to work with OpenText to find root cause and will provide a postmortem in the coming weeks.
Posted Nov 18, 2022 - 14:14 PST
Monitoring
All Forge Public sites have been restored and the system is currently stable. We will continue to monitor.
Posted Nov 17, 2022 - 20:06 PST
Update
The process of the full site recycle is successfully recovering service. Central 1 and OpenText will continue our investigation while monitoring stability.
Posted Nov 17, 2022 - 19:05 PST
Update
In an exercise to find the point of failure with the Forge 2.0 Public Websites, Central 1 is working with OpenText to completely takedown, recycle and restart each individual public website.

As we perform this exercise your members may notice a brief outage in public website access instead of the degraded performance. Online banking access and Mobile App will continue to remain available throughout this triage.

An update will be provided on or before 7:00pm P.T. (10:00pm E.T.)

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 18:12 PST
Update
Central 1 continues to support our vendor's investigation. Additional logging and debugging has been implemented to assist with the triage.

An update will be provided on or before 6:00pm P.T. (9:00pm E.T.)

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 17:09 PST
Update
We are continuing to investigate this issue.
OpenText and Central 1 are working together on finding a solution.

An update will be provided on or before 5:00pm P.T. (8:00pm E.T.)

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 16:00 PST
Update
Investigation continues, as this incident is being treated with highest priority. OpenText and Central 1 are working together on finding a solution.

An update will be provided on or before 4:00pm P.T. (7:00pm E.T.)

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 14:56 PST
Update
OpenText is continuing to work to recover Forge 2.0 Public Website stability. Your public website has been latent or unavailable since 11 a.m. PT (2 p.m. ET).

Online banking is not affected, and users/members who have the login page bookmarked will be able to login and perform transactions. Mobile App has not experienced any degradation and continues to remain available for members.

Central 1 will continue to escalate with our vendor and provide our next update by 3 p.m. PT (6 p.m. ET) or sooner if services stability recovers

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 13:04 PST
Investigating
Please note that we are currently experiencing an OpenText public Website outage. Users will experience extreme slowness when visiting different domains or a time out gateway error 502. OpenText is actively investigating and an update will be provided on or before 1:00pm P.T. (4:00pm E.T.)

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 17, 2022 - 11:27 PST
This incident affected: Incident Alerting.