Harness Platform UI was intermittently inaccessible during the time of the incident.
Root Cause
One of the microservices in the harness cluster of services had an issue that lead to a spike in memory utilization of the cache store. Other services that depend on the same store got impacted leading to UI becoming inaccessible.
Timeline
Time
Event
8th May 4:10pm PST
Pipeline Service team started validation of a change by running a script.
8th May 4:27pm PST
Internal call was started and we started checking the failure.
8th May 4:30pm PST
Identified redis memory spike around 10 mins earlier
8th May 4:33pm PST
PL team help in deleting cache key entries for schemaDetailsCache and partialSchemaCache
8th May 4:35pm PST
OPS team helped increase redis memory store cache higher value to accommodate the spike.
8th May 4:40pm PST
Functionality returned back to normal
Action items
Fix the bug in the API that caused the spike cache memory store.
Gracefully handle such memory spikes/exhaustion of resources.
Posted May 10, 2023 - 11:04 PDT
Resolved
We can confirm normal operation. Get Ship Done! We will continue to continue to monitor to ensure stability, additionally, we are investigating the root cause and will publish a postmortem soon.
Posted May 08, 2023 - 16:47 PDT
Monitoring
Identified a caching failure when accessing app.harness.io, service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.