Issue with Harness login
Incident Report for Harness
Postmortem

Issue with Harness login

Event Description:

Users were not able to sign in to Harness for two of our three production clusters. Already logged in users were not affected.

Incident Timelines:

[02/07/2023, Time in PST]

06:15 PM: Harness detected the issue.

06:17 PM: Incident process started and engineering started investigating.

07:11 PM: Incident resolved via rollback of the auth ui service.

Findings/Root Cause:

Harness application has an independent microservice for user sign-in & sign-up functionality. The sign-in page stopped loading due to an incompatible JS library. Our CI pipeline had a bug that allowed reusing a tag for new releases. As a result, a pre-release docker image was published with the tag already in use in production. We have already taken steps to put guard rails on who can execute this specific pipeline till we have the following completed.

It took some time to find the root cause since the trigger point was a cluster node restart which caused a new image version to be used and did not correlate to any recent deployment.

Learnings and Further Action Items:

We reviewed all our CI pipelines and have taken up the following action items to ensure this issue does not recur.

  1. Standardize CI pipelines never to allow the reuse of release tags.
  2. Use Docker Image SHA instead of docker tags for production services.
  3. Create a separate production docker repository from the pre-release docker repository.
Posted Feb 08, 2023 - 16:15 PST

Resolved
We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor and ensure stability.
Posted Feb 07, 2023 - 19:37 PST
Monitoring
Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
Posted Feb 07, 2023 - 19:11 PST
Update
We have confirmed that this issue is with the login experience. Users currently logged into Harness can continue and work with no problems and should avoid logging out to prevent being impacted.
Posted Feb 07, 2023 - 18:43 PST
Investigating
The Harness service is experiencing an issue with the login page to app.harness.io. We are working to identify the cause and restore normal operations as soon as possible.
Posted Feb 07, 2023 - 18:23 PST
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Cloud Builds, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)) and Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Cloud Builds, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)).