Cluster Wide Outage

Incident Report for 2600hz

Postmortem

On Thursday 2nd Feb we encountered an issue on Zswitch causing call failures and issues accessing the UI. We were first alerted to issues by our monitoring system; telling us there were errors on FS within EWR. As a precautionary measure we decided to pause FS on the alerting nodes. The issue instantly spread to the rest of the FS servers; at this point we involved the engineering team and jumped on an all hands "911" call.

Our engineering, operations and support teams further investigated the issues which uncovered that BigCouch seemed to be in a bad state. (We could see when trying to pull information from the DB, the apps were occasionally getting timeouts or at least seeing very long response times.) This again pointed to the databases being overloaded. All of the DB nodes were checked, all compactions stopped, and a slow restart of all the DB nodes was carried out.

After the DB's were back up and working we found the following culmination of tasks caused the issues. (We would like to stress that any individual, or even two of these tasks wouldn't usually cause issues; this is a very edge case scenario.) -

The "DB of DB's" (A file called dbs.couch which contains all the locations of all shards within the database) was much larger than it should be. This would cause DB tasks not to be completed as quickly and left BigCouch in a more fragile state.) To shrink this a separate compaction task needs to be carried out. Separate to the usual compaction carried out on a regular basis.
A new feature was added into the latest version to be able to bill for ephemeral tokens executed an inefficient implementation of a monthly roll up which put heavy load on the already struggling DB.
Two compaction tasks were running on separate nodes throughout the cluster to address disk space alerts. (This is a normal and routine task, although the culmination of the two other issues caused heavier load on the DB).

To fully resolve the issue the DB of DB's was subsequently compacted on all of the databases followed by a restart of BigCouch.

To prevent this issue from reoccurring we are taking the following steps in the short term -

A full audit of the DB compaction procedure to make sure we are compacting everything which needs to be and there's no way BigCouch can reach the same fragile state it did again.
Our engineering team are fixing the ephemeral report generator so it's not as aggressive with the DB.

In the long term, we have a plan to upgrade all servers onto CouchDB 3, which will compact automatically in a much more proficient manner.

We again apologise for the interruption to service, please be assured we working hard to improve our service. We strive to provide the best telecommunications platform available. This was as mentioned previously a very edge case scenario, a perfect storm of DB tasks which resulted in instability; any two of the tasks outlined we are confident wouldn't have caused an issue. If anyone has any further questions, please do send a ticket into support. We'll be more than happy to address any concerns and go into further detail with the steps we're taking to improve.

Posted Feb 07, 2023 - 10:13 PST

Resolved

We believe the issue is now resolved. We are continuing to monitor services

Posted Feb 02, 2023 - 15:48 PST

Identified

We are having success improving the DB response rate. Though we are still seeing a small percentage of intermittent failures

Posted Feb 02, 2023 - 14:53 PST

Update

We have identified this as a database issue. We are still investigating the root cause

Posted Feb 02, 2023 - 13:29 PST

Update

We are continuing to investigate this issue.

Posted Feb 02, 2023 - 12:30 PST

Investigating

We are receiving and have verified reports that there are call failures happening in all zones on ZSwitch

Posted Feb 02, 2023 - 12:30 PST

This incident affected: Telephony Services and Management Portal.