Root Cause Analysis: 2/15/24
|
|
Issue |
Degraded Harmony Response Agent processing |
Impact |
Critical |
Services Impacted |
Cloud Login Rest API Operations APIs |
Location |
NA (North America Cloud) |
Problem Description |
Slow Login: Users experience delays during login to the application. Agent Processing: Operations stuck in submitted or pending |
Root Cause |
Degraded Connection: Underlying node types were causing slow network performance. |
Backlog of jobs: A backlog of tasks has accumulated due to a previous issue, resulting in a backlog in job assignments. |
| Impact | User Experience: Slow to no login Application Performance: The overall application responsiveness Agent Processing: Operation processing delay |
| Resolution Steps | Scale additional compute resources Trace connectivity path Database Health Check: Investigated the data layer for resource bottlenecks, query performance, and server health. Network Troubleshooting: Verified network connectivity Service: Observed issue was intermittent but processing slow Monitoring: Reviewed monitors and breakpoints Logs: Increase resources at the messaging layer to process the job backlog. As part of this process, the queue was cycled. Increase resources at the messaging layer to process the job backlog. As part of this process, the queue was cycled. |
| Action | Tactical Remove faulty directory provider Add monitors for data layer to detect token issues Add monitors for operations queue Review data query response times |
Strategic Profile sequence that lead to performance degradation Add circuit breakers to prevent cascading effect Add cordone areas to help with faster resolution |