Users were not able to sign in to Harness for two of our three production clusters. Already logged in users were not affected.
[02/07/2023, Time in PST]
06:15 PM: Harness detected the issue.
06:17 PM: Incident process started and engineering started investigating.
07:11 PM: Incident resolved via rollback of the auth ui service.
Harness application has an independent microservice for user sign-in & sign-up functionality. The sign-in page stopped loading due to an incompatible JS library. Our CI pipeline had a bug that allowed reusing a tag for new releases. As a result, a pre-release docker image was published with the tag already in use in production. We have already taken steps to put guard rails on who can execute this specific pipeline till we have the following completed.
It took some time to find the root cause since the trigger point was a cluster node restart which caused a new image version to be used and did not correlate to any recent deployment.
We reviewed all our CI pipelines and have taken up the following action items to ensure this issue does not recur.