Harness production incident due to third-party vendor misconfiguration

We want to share the details about the incident where the pipelines could not advance after a step was completed. This impacted the deployments and builds in Prod-1 and Prod-2 clusters between 8:28 AM — 9:48 AM PT on Nov 30th, 2022. Next Gen Continuous Delivery, Continuous Integration, Service Reliability Management, Feature Flags, and Security Testing Orchestration were the modules that got impacted. Harness Current Gen Modules were not affected.

Root cause

Harness pipeline service relies on a third-party in-memory database provider. A rollout of the wrong configuration due to human error by third-party personnel caused the harness pipeline service failure. The vendor initiated a project to replace the self-signed server certificate with a signed certificate by GlobalSign across their fleet. They executed the first step for some of the non-TLS-enabled database clusters. By mistake, Harness clusters got added to the batch resulting in an outage since the client didn’t trust the new certificate.

Remediation

The vendor reverted their incorrect config changes by rolling back the server certificate across the Harness clusters.

Timeline

8:28 AM -the first alert fired, and we triggered pager duty.
8:38 AM — Status page updated.
8:46 AM — We identified the issue was related to a third-party in-memory database, and we opened a ticket with the vendor.
8:47 AM — While we were waiting on the vendor, Harness engineering tried different config changes and debugging to see whether we could address the issue.
9:12 AM — Harness side config changes fail to solve the problem.
9:24 AM — The vendor joined the troubleshooting call.
9:37 AM — The vendor reverted their incorrect changes, and Harness services started to recover
9:48 AM — Sanity checks pass. Issue resolved.

Action Items

Harness is working with the third-party vendor to improve their support SLA times.
Harness is re-evaluating their architecture to reduce its dependence on this third-party provider. This is an ongoing discussion and change and is already underway**.**

Posted Dec 05, 2022 - 14:26 PST

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor and ensure stability.

Posted Nov 30, 2022 - 10:58 PST

Monitoring

Service issues have been addressed, and normal operations have been resumed. We are monitoring the service to ensure normal performance continues. We will publish an RCA for this incident as soon as we can.

Posted Nov 30, 2022 - 09:48 PST

Update

We have no new update, but we are working hard to fix the issue we identified as the root cause.

Posted Nov 30, 2022 - 09:24 PST

Identified

We have identified a potential issue causing the service access problem and are working hard to address it. Please continue to monitor this page for updates.

Posted Nov 30, 2022 - 08:46 PST

Investigating

The Harness pipelines are not progressing. We are working to identify the root cause and restore the service as soon as possible.

Posted Nov 30, 2022 - 08:38 PST

This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Cloud Builds, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)) and Prod 1 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Cloud Builds, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)).