Background

The Talkeetna cluster is an HP-UX cluster of servers that hosts Banner Job Submission services, Banner print queues, and runs most Application Manager jobs. Users often will use SSH, SCP, or SFTP clients to connect to the hosts in order to place or retrieve files used by Banner and Application Manager. The functionality of these servers is currently being migrated to the Ellucian Cloud as part of the Banner Cloud migration project with a target go live of October 2022.

Break Down of the Problem

Around 8:30 AM, technicians became aware of login issues being reported by end users trying to access prod.alaska.edu. At 9AM it was identified that the cluster node that provides authentication services for end-user SSH and SFTP access quit responding to authentication requests. Technicians were able to restore authentication services at around 12:30 PM after the cluster restarted.

Target State / Goal

Service should be available 24/7 except for published maintenance windows.

Root Cause Analysis

At around 12 AM, Tazlina, one of the cluster nodes, lost a disk that was part of its root volume group vg00. This volume group was mirrored to another disk so no data was lost, however the logical volumes in the volume group were unmounted from the file system which caused the kerberos service that handles authentication for the cluster and many other services on Tazlina to fail as the data needed resided in a path that was no longer accessible. The volumes that went offline due to the failed disk included the root path /, /var, /opt, /stand, /home, /usr, /usr/users, /var/adm/crash, and /hyperion.

Most of the cluster packages and processes including job submission and app manager agents were running on Tetlin, another cluster node, so this failure did not impact monitored services and was undetected until end users reported issues authenticating. At 11:56 AM Tetlin tried to Transfer of Control (TOC) to Tazlina. Tazlina was unable to take control due to the missing volumes and so this triggered a restart of both Tetlin and Tazlina in order to preserve data integrity. Upon reboot, Tazlina was able to remount the logical volumes off of the mirrored physical volume of vg00. Cluster services were started on Tetlin as a single node cluster, and the kerberos service was restarted on Tazlina to enable end-user authentication for the cluster.

Develop Countermeasures

Identify failed disk
Reconfigure Logical Volumes to remove mirror requirement on failed disk
Remove the failed disk from the volume group
Source a replacement Drive
Reconfigure vgroup with new disk and reconfigure mirroring for logical volumes
Investigate how to move kerberos service to another cluster node in the case of failure
Investigate configuring a shadow or failover for kerberos

Implementation of Countermeasures

May 23, 2022: 10:50 AM: The failed disk was identified as the top right disk on the chassis

May 23, 2022: 3:30 PM: Technicians searched through the data center but were unable to locate a replacement disk. Currently working to identify support or purchase options.

May 23, 2022: 5:30 PM: Technicians completed removing the failed disk from all of the logical volumes and volume group.

Follow Up / Review

June 3, 2022: Replace failed disk and reconfigure mirrors for logical volumes.

June 10, 2022: Implement fix for more resilient kerberos service.

Posted May 24, 2022 - 17:06 AKDT

Resolved

This incident has been resolved.

Posted May 24, 2022 - 12:17 AKDT

Monitoring

A fix has been implemented and we are currently monitoring the incident.

Posted May 23, 2022 - 13:03 AKDT

Update

We are still working on a fix for this issue but have a workaround for those that need SSH access. Please contact us so we can start the process to get you the workaround.

Posted May 23, 2022 - 11:17 AKDT

Identified

We have identified the issue preventing the connections and are working on it. We do not currently have a ETA but will provide updates here.

Posted May 23, 2022 - 09:41 AKDT

Investigating

SSH/SFTP are currently not connecting to prod.alaska.edu, lrgp.alaska.edu, prep.alaska.edu and test.alaska.edu. We are currently investigating the incident.

Posted May 23, 2022 - 09:31 AKDT

This incident affected: Banner (Banner Admin Modules).