On Thursday 2nd Feb we encountered an issue on Zswitch causing call failures and issues accessing the UI. We were first alerted to issues by our monitoring system; telling us there were errors on FS within EWR. As a precautionary measure we decided to pause FS on the alerting nodes. The issue instantly spread to the rest of the FS servers; at this point we involved the engineering team and jumped on an all hands "911" call.
Our engineering, operations and support teams further investigated the issues which uncovered that BigCouch seemed to be in a bad state. (We could see when trying to pull information from the DB, the apps were occasionally getting timeouts or at least seeing very long response times.) This again pointed to the databases being overloaded. All of the DB nodes were checked, all compactions stopped, and a slow restart of all the DB nodes was carried out.
After the DB's were back up and working we found the following culmination of tasks caused the issues. (We would like to stress that any individual, or even two of these tasks wouldn't usually cause issues; this is a very edge case scenario.) -
To fully resolve the issue the DB of DB's was subsequently compacted on all of the databases followed by a restart of BigCouch.
To prevent this issue from reoccurring we are taking the following steps in the short term -
In the long term, we have a plan to upgrade all servers onto CouchDB 3, which will compact automatically in a much more proficient manner.
We again apologise for the interruption to service, please be assured we working hard to improve our service. We strive to provide the best telecommunications platform available. This was as mentioned previously a very edge case scenario, a perfect storm of DB tasks which resulted in instability; any two of the tasks outlined we are confident wouldn't have caused an issue. If anyone has any further questions, please do send a ticket into support. We'll be more than happy to address any concerns and go into further detail with the steps we're taking to improve.