Stream API degraded availability

Incident Report for Productsup

Postmortem

SUMMARY

We experienced issues with our Kafka cluster used for our Stream API on Wednesday, the 29th of March, between 12:40 CEST and 15:40 CEST.

From 12:40 CEST till 13:40 CEST, we had a partial outage, which means that only about 50% of all requests got through. We started the investigation into the instability of the cluster and tried several remedial actions to see if we could get the cluster running again at total capacity.

However, we noticed that a single Node would become unavailable again whenever it was restarted. Between 13:40 CEST and 15:40 CEST, we rejected all requests to our Load Balancer to take the cluster offline and inspect each node individually. During this period, we had a complete outage of the service.

We restarted each Node and let it repair itself with an increased memory configuration. At 15:40 CEST, all nodes were stable again. We were able to enable the requests to our Load Balancer again. From this moment on, the system was back up and running. A small backlog had build-up during the downtime, but the backlog was handled within 20 minutes and did not cause a severe impact on the functionality of the Stream API.

An increased load to our Stream API and a combination of insufficiently powerful nodes to carry the cluster's full load were the main reasons for the instability.

To prevent this from happening again, we immediately enabled strict rate limiting for overusers of our API. Ever since, the Stream API has been running stable again.

No data was lost on our end. All requests that returned a successful status code (in the 2xx domain) were received and processed. All requests that returned a 500 status code should be resent if needed.

REMEDIAL ACTIONS PLAN & NEXT STEPS

Next week, we will enable strict rate limiting for all Stream API users. We will allow clients to make 30 requests per second per Stream. Till now, this was only a recommendation, and the hard limit was higher.

Additionally, we plan to extend our cluster next week. The servers have already been ordered and prepared.

We also have planned an action to review our maintenance documentation, review the guide on stability recommendations, and make changes where needed.

Productsup is committed to continually and quickly improving our technology and operational processes to prevent outages of our Stream API. Again, we appreciate your patience and apologize for the impact on you, your users, and your organization. We thank you for your business and continued support.

Posted Mar 30, 2023 - 17:53 CEST

Resolved

Last night we made some configuration changes to provide more stability to the cluster. We did not see peaks of errors or failed connections over the previous 12 hours.

We will follow up with a post-mortem later today.

Posted Mar 30, 2023 - 07:43 CEST

Monitoring

The cluster is online again. We do anticipate a queue build up with our clients, so there could be a slight performance degradation in the upcoming minutes. We do not expect any failing connections to our system.

We continue to monitor the behavior of the cluster and our API. Tomorrow we will follow up with a post-mortem and action plan for improvements.

Posted Mar 29, 2023 - 15:38 CEST

Identified

We have allocated the issue to a single node. After the recovery, we modified the configuration to be more stable. These updates are also applied to the other Nodes. We are running the final checks before the cluster can go online.

Posted Mar 29, 2023 - 15:20 CEST

Update

One of the nodes is in recovery; we are waiting till this is done before enabling the cluster again.

Posted Mar 29, 2023 - 14:39 CEST

Update

We are restarting nodes in our cluster and inspecting the reported errors.

In the meantime, all requests to our load balancer are blocked.

Posted Mar 29, 2023 - 13:57 CEST

Update

We will perform a complete cluster restart, expecting a brief full outage of the system.

We will stop traffic to our load balancer, temporarily rejecting all requests.

Posted Mar 29, 2023 - 13:40 CEST

Investigating

We're experiencing issues with our Stream API cluster and are seeing an increased number of failed connections.

Sorry for the inconvenience.

Posted Mar 29, 2023 - 13:07 CEST

This incident affected: Stream API.