Harness CI builds are failing intermittently
Incident Report for Harness
Postmortem

We want to share the details about the slowness and error observed in our Prod2 cluster that impacted our customers on 19th Jan 23 between 3:10 AM - 4:49 AM PST.

Impact:

Harness Service was observing intermittent slowness and errors in Next-Gen. CI pipelines having run step were failing.

Root Cause:

It was observed that CI pipelines having run step inside StepGroup pipelines were failing on this step execution. StepGroup runs in Pipeline-Service (as CI service didn't register StepGroup step in their service if registered, steps inside CI service will always go to CI service) & post that both CI and STO claim that they can execute the run step, thus for only steps inside stepGroup we verify with all services if they can run it. Hence when this step ran inside the STO service it caused this failure.

Incident timeline:

3:10 AM - CI team found the issue for the customer

3:20 AM - Pipeline-Service was rollbacked to the previous version of 1.19.2

3:30 AM - Verified that Execution is passing now.

3:45 AM - Found that pipelines which were created (but not executed) during new deployment after rollback, causing the pipeline list page not to load.

4:08 AM - Updated customer pipeline entity records to handle non-backward compatible change made in PipelineEntity regarding a field moved from primitive to Java Wrapper Class.

4:49 AM - The status page was marked as Resolved.

Remediation:

We rolled back the recent deployment as an immediate remedy, which resolved the issue. Post rollback, we found that pipelines created (but not executed during the new deployment ) caused the pipeline list page not to load. To remediate this, we did a DB update for the affected pipelines to handle non-backward compatible change, which resolved the issue.

Action items:

  • Short Term Plan
    1. Making STO service handle new changes in pipeline-SDK and PipelineEntity having backward compatible changes, hotfix in pipeline-service version 1.20
    2. Internally validate if these CI steps should register with STO.
    3. Notify all stakeholders to update their SDK if the dependent change in SDK has to be adopted by each service. Wait for at least two releases (of every service) before removing backward compatible change.
  • Long Term Plan
    1. Include RollbackTesting in Pre-Prod env to test normal sanity.
    2. We will work on exploring moving Pipeline-SDK to versioned JAR so that we are not dependent on making non-backward optimised changes in SDK, and each service can adopt the SDK version according to its release cycle.
    3. Whenever we stop supporting any SDK, we will inform all teams, and that jar version will be removed from our jFrog repository.
Posted Jan 23, 2023 - 00:07 PST

Resolved
We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor and ensure stability.
Posted Jan 19, 2023 - 04:49 PST
Monitoring
Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Posted Jan 19, 2023 - 04:22 PST
Identified
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
Posted Jan 19, 2023 - 04:00 PST
Investigating
The Harness CI builds are failing intermittently. We are working to identify the cause and restore normal operations as soon as possible.
Posted Jan 19, 2023 - 03:10 PST
This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Cloud Builds).