Dear users,
Our prediction API experienced an outage on 22nd December 2022, between 12:30 UTC and 16:30 UTC. During this incident, processing of files and uploads to the platform were adversely affected. We would like to share more details about the incident.
As part of our ongoing efforts to improve our platform's reliability, our teams were working on separating a microservice. During this activity, we made changes to the network configuration in our VPC. Although we reverted the change within 10 minutes of noticing the downtime, our inference services running in the same VPC were affected for nearly 2 hours.
Post reverting the change, we tried to bring online a backup cluster, which is a time-consuming process requiring up to 90 minutes. Our team, while parallelly debugging, found that an infrastructure resource needed a restart in addition to reverting the configuration change. As per our discussion with our infrastructure providers, the restart should not have been required. We're in touch with our infrastructure providers to understand the reason for this miscommunication.
For the next 2 hrs, our system was under pressure and the latency took a hit to process all the files that had been stuck in our queues during the downtime while also handling live traffic. We nearly tripled the amount of compute resources to handle this. All files sent to the async API have been successfully processed now.
We want to apologize for the impact of this outage. We understand how critical our services are to your business. We're taking the below steps to ensure that the possibility of such incidents is minimal:
Regards,
Nanonets team