Harness Status - Error loading app.harnes.io

Error loading app.harnes.io - inaccessible

Incident Report for Harness

Postmortem

Impact

Harness Platform UI was intermittently inaccessible during the time of the incident.

Root Cause

One of the microservices in the harness cluster of services had an issue that lead to a spike in memory utilization of the cache store. Other services that depend on the same store got impacted leading to UI becoming inaccessible.

Timeline

Time	Event
8th May 4:10pm PST	Pipeline Service team started validation of a change by running a script.
8th May 4:27pm PST	Internal call was started and we started checking the failure.
8th May 4:30pm PST	Identified redis memory spike around 10 mins earlier
8th May 4:33pm PST	PL team help in deleting cache key entries for `schemaDetailsCache` and `partialSchemaCache`
8th May 4:35pm PST	OPS team helped increase redis memory store cache higher value to accommodate the spike.
8th May 4:40pm PST	Functionality returned back to normal

Action items

Fix the bug in the API that caused the spike cache memory store.
Gracefully handle such memory spikes/exhaustion of resources.

Posted May 10, 2023 - 11:04 PDT

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor to ensure stability, additionally, we are investigating the root cause and will publish a postmortem soon.

Posted May 08, 2023 - 16:47 PDT

Monitoring

Identified a caching failure when accessing app.harness.io, service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

Posted May 08, 2023 - 16:40 PDT

Investigating

We are currently investigating the issue.

Posted May 08, 2023 - 16:36 PDT

This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS).