Degraded performance

Incident Report for Transifex

Postmortem

On Thursday, October 20th, 2022, starting at 17:20:30 (UTC+03:00), our users started facing difficulties using our services. We would like to highlight that during this incident no data was lost.

We were running our routine maintenance updates for our systems. During this process, our CI tool misbehaved and our internal components started experiencing a partial outage. Before restoring the functionality, the issue got escalated.

During this incident, the following systems were affected:

Transifex Application
Notifications
API/CLI
Transifex Live
Website

Detection was almost immediate (~5mins). Once the incident was confirmed, we triggered our major incident management process and formed a cross-functional incident management team.

There are two phases in the incident. Partial and complete unavailability.

Partial unavailability - From 20/10/2022, 17:20:30 (UTC+03:00) to 20/10/2022, 19:00:08 (UTC+03:00)

Our website and all of our components were up and running with one main issue. File downloads and uploads were unavailable and as a result, our users couldn’t complete these actions successfully.

Complete unavailability - From 20/10/2022, 19:00:08 (UTC+03:00) to 20/10/2022, 20:00:00 (UTC+03:00)

In order to address the issue, our DevOps team had to proceed with recovery. In order for this to happen, we had to cause complete unavailability so that we could replace resources affected by the incident.

After resolving the incident, we had been monitoring the situation closely for the next 4 hours and confirmed that everything was back to normal.

We have taken a number of immediate actions and are committed to making changes to avoid this situation in the future. Here are specific areas where we have made or will make significant changes:

We thoroughly reviewed our maintenance update process and identified parts to improve. We will update our training, and tooling, and continuously improve our standard operating procedure.
We set up safeguards to protect and isolate our core infrastructure.
We already started working on a failover infrastructure implementation.

We understand that our product is mission-critical to your business, and we don’t take that responsibility lightly. To our customers and our partners, we thank you for your continued trust and partnership. We hope the details and actions outlined here to show our commitment that Transifex will continue to provide a cloud platform with scalable infrastructure and a steady cadence of enhancements.

Posted Oct 24, 2022 - 10:52 UTC

Resolved

Everything is back to normal.
Thank you for your patience and we apologize for the inconvenience caused.

Posted Oct 20, 2022 - 19:18 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 20, 2022 - 19:04 UTC

Investigating

You may experience difficulties accessing Transifex.
We are currently investigating the issue. Thank you for your patience and understanding!

Posted Oct 20, 2022 - 14:50 UTC

This incident affected: Transifex App and Transifex API.