Banner Prod SSH/SFTP not connecting
Incident Report for OIT Services
Postmortem

Background

The Talkeetna cluster is an HP-UX cluster of servers that hosts Banner Job Submission services, Banner print queues, and runs most Application Manager jobs.  Users often will use SSH, SCP, or SFTP clients to connect to the hosts in order to place or retrieve files used by Banner and Application Manager.  The functionality of these servers is currently being migrated to the Ellucian Cloud as part of the Banner Cloud migration project with a target go live of October 2022.

Break Down of the Problem

Around 8:30 AM, technicians became aware of login issues being reported by end users trying to access prod.alaska.edu.  At 9AM it was identified that the cluster node that provides authentication services for end-user SSH and SFTP access quit responding to authentication requests. Technicians were able to restore authentication services at around 12:30 PM after the cluster restarted.

Target State / Goal 

Service should be available 24/7 except for published maintenance windows.

Root Cause Analysis 

At around 12 AM, Tazlina, one of the cluster nodes,  lost a disk that was part of its root volume group vg00.  This volume group was mirrored to another disk so no data was lost, however the logical volumes in the volume group were unmounted from the file system which caused the kerberos service that handles authentication for the cluster and many other services on Tazlina to fail as the data needed resided in a path that was no longer accessible.  The volumes that went offline due to the failed disk included the root path /, /var, /opt, /stand, /home, /usr, /usr/users, /var/adm/crash, and /hyperion.

Most of the cluster packages and processes including job submission and app manager agents were running on Tetlin, another cluster node, so this failure did not impact monitored services and was undetected until end users reported issues authenticating. At 11:56 AM Tetlin tried to Transfer of Control (TOC) to Tazlina. Tazlina was unable to take control due to the missing volumes and so this triggered a restart of both Tetlin and Tazlina in order to preserve data integrity.  Upon reboot, Tazlina was able to remount the logical volumes off of the mirrored physical volume of vg00. Cluster services were started on Tetlin as a single node cluster, and the kerberos service was restarted on Tazlina to enable end-user authentication for the cluster.  

Develop Countermeasures 

  1. Identify failed disk
  2. Reconfigure Logical Volumes to remove mirror requirement on failed disk
  3. Remove the failed disk from the volume group
  4. Source a replacement Drive
  5. Reconfigure vgroup with new disk and reconfigure mirroring for logical volumes 
  6. Investigate how to move kerberos service to another cluster node in the case of failure
  7. Investigate configuring a shadow or failover for kerberos

Implementation of Countermeasures

May 23, 2022: 10:50 AM: The failed disk was identified as the top right disk on the chassis

May 23, 2022: 3:30 PM: Technicians searched through the data center but were unable to locate a replacement disk. Currently working to identify support or purchase options.

May 23, 2022: 5:30 PM:  Technicians completed removing the failed disk from all of the logical volumes and volume group.

Follow Up / Review

June 3, 2022: Replace failed disk and reconfigure mirrors for logical volumes.

June 10, 2022:  Implement fix for more resilient kerberos service.

Posted May 24, 2022 - 17:06 AKDT

Resolved
This incident has been resolved.
Posted May 24, 2022 - 12:17 AKDT
Monitoring
A fix has been implemented and we are currently monitoring the incident.
Posted May 23, 2022 - 13:03 AKDT
Update
We are still working on a fix for this issue but have a workaround for those that need SSH access. Please contact us so we can start the process to get you the workaround.
Posted May 23, 2022 - 11:17 AKDT
Identified
We have identified the issue preventing the connections and are working on it. We do not currently have a ETA but will provide updates here.
Posted May 23, 2022 - 09:41 AKDT
Investigating
SSH/SFTP are currently not connecting to prod.alaska.edu, lrgp.alaska.edu, prep.alaska.edu and test.alaska.edu. We are currently investigating the incident.
Posted May 23, 2022 - 09:31 AKDT
This incident affected: Banner (Banner Admin Modules).