Error loading app.harnes.io - inaccessible
Incident Report for Harness
Postmortem

Impact

Harness Platform UI was intermittently inaccessible during the time of the incident.

Root Cause

One of the microservices in the harness cluster of services had an issue that lead to a spike in memory utilization of the cache store. Other services that depend on the same store got impacted leading to UI becoming inaccessible.

Timeline

Time Event
8th May 4:10pm PST Pipeline Service team started validation of a change by running a script.
8th May 4:27pm PST Internal call was started and we started checking the failure.
8th May 4:30pm PST Identified redis memory spike around 10 mins earlier
8th May 4:33pm PST PL team help in deleting cache key entries for schemaDetailsCache and partialSchemaCache
8th May 4:35pm PST OPS team helped increase redis memory store cache higher value to accommodate the spike.
8th May 4:40pm PST Functionality returned back to normal

Action items

  • Fix the bug in the API that caused the spike cache memory store.
  • Gracefully handle such memory spikes/exhaustion of resources.
Posted May 10, 2023 - 11:04 PDT

Resolved
We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor to ensure stability, additionally, we are investigating the root cause and will publish a postmortem soon.
Posted May 08, 2023 - 16:47 PDT
Monitoring
Identified a caching failure when accessing app.harness.io, service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Posted May 08, 2023 - 16:40 PDT
Investigating
We are currently investigating the issue.
Posted May 08, 2023 - 16:36 PDT
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS).