Functions are failing to start in the US-West PoP
Incident Report for PubNub
Postmortem

Problem Description, Impact, and Resolution

This was a re-emergence of the Functions incident from 1/26 after the root cause was misidentified following the prior incident. Starting at around 01:37 UTC on 2021-01-28 some Functions were failing to start in our US-West PoP and published messages were failing to trigger those Functions, though all messages to Functions that had already been running performed correctly. The incident was triggered when a database used to register Functions reached a size that unexpectedly degraded performance, causing cascading effects in the systems used to trigger Functions.

We routed traffic around the US-West PoP to mitigate the impact so that all Functions were being triggered at 03:10 at which point the incident was fully resolved.

Mitigation Steps and Recommended Future Preventative Measures

To prevent a similar issue from occurring in the future we are proactively managing the size of the databases that could be impacted by the size threshold that was uncovered by the incident. We have also added items to our backlog to alter the dependencies on the existing data storage approach for registering Functions.

Posted Feb 22, 2021 - 16:59 UTC

Resolved
This incident has been resolved.
Posted Jan 28, 2021 - 05:13 UTC
Monitoring
Function executions remain operational and have been restored since 03:10 UTC. We'll have a separate post on this site for the outstanding Portal UI with the function running state. We're genuinely sorry for the disruption today. We'll be back with a summary of this issue.
Posted Jan 28, 2021 - 04:51 UTC
Update
Currently Functions will start and execute in all regions. The Portal will still show them as Pending but publishes in all regions will trigger the Function. The issue only affected the starting of Functions. All running functions were unaffected.
Posted Jan 28, 2021 - 03:10 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 28, 2021 - 02:49 UTC
Identified
Customers in our US West PoP are experiencing an issue where functions started to fail to execute.
Posted Jan 28, 2021 - 02:00 UTC
This incident affected: Functions (Functions Service) and Points of Presence (North America Points of Presence).