Multiple issues affecting some projects

Incident Report for Platform.sh

Postmortem

Overview

On March 2nd, at 20:23 UTC an incident was declared for the us-2 region. On-call engineers began receiving alerts for multiple sites down and immediately began an investigation.

What Happened

A maintenance operation applied to the storage layer unintentionally resulted in degraded performance for most of the grid hosts in the region. The purpose of the maintenance was to improve the storage layer by switching to an upgraded back end but the migration to this new system resulted in elevated CPU resource consumption leading to a major performance impact on any environment running on one of these affected hosts.

Resolution

In order to mitigate the impact of the incident, additional storage layer resources were provisioned and deployed to the region. The migration was able to continue as planned and grid hosts began to recover as expected.

Impact

We have paused this same maintenance across all regions to assess how to safely move forward.

Posted Mar 09, 2023 - 12:50 UTC

Resolved

This incident has been resolved.

We have released a fix for the storage backend and now region US-2 should function as expected.
Please create support ticket for further assistance if needed.

Posted Mar 03, 2023 - 03:44 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 03, 2023 - 00:03 UTC

Identified

We have identified the cause as related to our storage backend. We are issuing a fix now and will provide updates soon.

Posted Mar 02, 2023 - 21:02 UTC

Investigating

We are aware of some projects within the US-2 region that are experiencing a variety of issues. We are currently investigating and will provide updates when we have more information.

Posted Mar 02, 2023 - 20:23 UTC

This incident affected: USA-2 (East 2) (us-2.platform.sh).