Prediction API downtime

Incident Report for Nanonets

Postmortem

Dear users,

Our prediction API experienced an outage on 22nd December 2022, between 12:30 UTC and 16:30 UTC. During this incident, processing of files and uploads to the platform were adversely affected. We would like to share more details about the incident.

As part of our ongoing efforts to improve our platform's reliability, our teams were working on separating a microservice. During this activity, we made changes to the network configuration in our VPC. Although we reverted the change within 10 minutes of noticing the downtime, our inference services running in the same VPC were affected for nearly 2 hours.

Post reverting the change, we tried to bring online a backup cluster, which is a time-consuming process requiring up to 90 minutes. Our team, while parallelly debugging, found that an infrastructure resource needed a restart in addition to reverting the configuration change. As per our discussion with our infrastructure providers, the restart should not have been required. We're in touch with our infrastructure providers to understand the reason for this miscommunication.

For the next 2 hrs, our system was under pressure and the latency took a hit to process all the files that had been stuck in our queues during the downtime while also handling live traffic. We nearly tripled the amount of compute resources to handle this. All files sent to the async API have been successfully processed now.

We want to apologize for the impact of this outage. We understand how critical our services are to your business. We're taking the below steps to ensure that the possibility of such incidents is minimal:

Fragmenting our services as much as possible to prevent issues with one service from impacting another
Improving our queue processing systems to handle peaks originating in downtime events so that the response times start to improve immediately after the system is brought back online.

‌

Regards,

Nanonets team

Posted Dec 22, 2022 - 20:07 UTC

Resolved

This incident has been resolved.

Posted Dec 22, 2022 - 18:08 UTC

Update

We are continuing to monitor for any further issues.

Posted Dec 22, 2022 - 18:08 UTC

Update

We're processing files now. Our team is working on clearing backlog on the async queue.

Posted Dec 22, 2022 - 16:26 UTC

Monitoring

We have implemented a fix for the issue of prediction failures. However for customers on async mode, it will take longer to process the files. We are monitoring the situation and looking to scale our systems in the right manner.

Posted Dec 22, 2022 - 14:31 UTC

Investigating

We're currently experiencing networking issues with our infra provider due to which the services on our prediction APIs are affected.

We're looking into this on priority and post updates here as we get them.

Posted Dec 22, 2022 - 14:19 UTC

This incident affected: API.