Here are some more details on the incident.
At 12:29 UTC some instances of our internal service responsible for processing push campaigns started to fail.
At around 13:00 UTC all instances were stuck processing push campaigns.
At this point we were alerted by our monitoring system that these services weren't making progress; we then opened this incident at 13:19 UTC.
Soon after we found out that one of the caching database cluster used by this service was not behaving correctly because one node in the cluster was down; we then decided to restart this node.
At 13:29 UTC the node came back online, the database cluster started behaving correctly again and the service started making progress again.
At 13:54 UTC the service caught up and there was no longer any delay when processing push campaigns.
While the service was in a degraded state push notifications could be delayed between 15 to 45 minutes.
After we restored the caching database cluster this delay was progressively reduced.