We want to share the details about the slowness and error observed in our Prod2 cluster that impacted our customers on 19th Jan 23 between 3:10 AM - 4:49 AM PST.
Impact:
Harness Service was observing intermittent slowness and errors in Next-Gen. CI pipelines having run step were failing.
It was observed that CI pipelines having run step inside StepGroup pipelines were failing on this step execution. StepGroup runs in Pipeline-Service (as CI service didn't register StepGroup step in their service if registered, steps inside CI service will always go to CI service) & post that both CI and STO claim that they can execute the run step, thus for only steps inside stepGroup we verify with all services if they can run it. Hence when this step ran inside the STO service it caused this failure.
Incident timeline:
3:10 AM - CI team found the issue for the customer
3:20 AM - Pipeline-Service was rollbacked to the previous version of 1.19.2
3:30 AM - Verified that Execution is passing now.
3:45 AM - Found that pipelines which were created (but not executed) during new deployment after rollback, causing the pipeline list page not to load.
4:08 AM - Updated customer pipeline entity records to handle non-backward compatible change made in PipelineEntity regarding a field moved from primitive to Java Wrapper Class.
4:49 AM - The status page was marked as Resolved.
Remediation:
We rolled back the recent deployment as an immediate remedy, which resolved the issue. Post rollback, we found that pipelines created (but not executed during the new deployment ) caused the pipeline list page not to load. To remediate this, we did a DB update for the affected pipelines to handle non-backward compatible change, which resolved the issue.
Action items: