Yesterday we had an incident where one of our regions was impacted by high latency and error rate for about an hour. The situation happened where our database connection resources in the region had been running quite high during peak times, and yesterday everything came to a boiling point where connections where exhausted and held by our application. We have connection pools and proxies to prevent this situation, but an edge case in our timeouts caused a lockup.
While we were able to free up the connection resources that were stuck temporarily, it had been stuck for some time so there was considerable back pressure and messages that had been waiting to be processed. Working through the backlog of messages took another hour before performance was able to return to acceptable levels.
Learnings:
Actions:
We are sorry for any inconvenience any of the above has caused, but please know we’re working hard every day to provide a performant solution your business can rely on.
Thanks
The SRE Team @ Gorgias