TOAD Connection Issues
Incident Report for OIT Services
Postmortem

Background

The OIDP service is a load balanced OpenLDAP implementation that serves as an Oracle Internet Directory service.  Oracle clients can be pointed at this service to resolve the connection information for Databases.  This simplifies the migration of databases as clients do not need to be individually updated when the database location changes.

Break Down of the Problem

On Dec 7th at 05:54:50 AM the OIDP service became unavailable. Customers who were using this service with TOAD or other Oracle clients were no longer able to connect to Databases.  As a workaround, users could manually specify each database connection if known, however this requires changing settings and files so most users chose to wait until the service was restored at 12:52 PM. 

Target State / Goal 

This service should be available to customers 24/7 except during pre-defined maintenance windows.

Root Cause Analysis 

On Dec 7th at 05:54:50 AM a network disruption occurred which caused the default node health monitor in the loadbalancer to be disconnected and enter a failed state.  This caused the nodes to be marked as offline and so the load balancer quit routing traffic to the servers.  When the network disruption was resolved the node monitor did not reset automatically causing the nodes to continue to be marked as offline despite passing other health checks.  May be due to this known issue

Monitoring triggered a PagerDuty incident at 6:07 AM that the service was down, but it was marked as low-urgency and so didn’t escalate until 8:00 AM.  Technicians began troubleshooting at around 10:14 AM, but the urgency was not properly understood initially and so the focus was on identifying why the load balancer did not mark the node as healthy again instead of restoring service. At 12:52 PM after calls to the service desk were relayed about TOAD users being impacted, the issue was resolved by manually disabling the nodes’ health monitor and re-enabling it. This caused the nodes to once again be marked as online and service restored. 

Develop Countermeasures 

Despite monitoring correctly identifying and alerting to the issue, response time was low due to lack of technicians comfortable with the F5 and a misunderstanding of the impact.  Team members will do cross training and make sure everyone understands how to address this issue in the future and to respond in a more timely manner.

Implementation of Countermeasures

December 8, 2022 - Discuss the urgency with team members and cross train on how to resolve the issue.

Posted Dec 14, 2022 - 10:27 AKST

Resolved
This issue has been resolved, and TOAD connections should be available now. Please contact the OIT Service Desk if you continue to experience any issues with TOAD.
Posted Dec 07, 2022 - 13:56 AKST
Identified
OIT has identified an issue with our TOAD service and we are working on a resolution. Some users may experience TOAD connection errors until this is resolved. Thank you for your patience.
Posted Dec 07, 2022 - 12:34 AKST