SUMMARY
We experienced issues with our Kafka cluster used for our Stream API on Wednesday, the 29th of March, between 12:40 CEST and 15:40 CEST.
From 12:40 CEST till 13:40 CEST, we had a partial outage, which means that only about 50% of all requests got through. We started the investigation into the instability of the cluster and tried several remedial actions to see if we could get the cluster running again at total capacity.
However, we noticed that a single Node would become unavailable again whenever it was restarted. Between 13:40 CEST and 15:40 CEST, we rejected all requests to our Load Balancer to take the cluster offline and inspect each node individually. During this period, we had a complete outage of the service.
We restarted each Node and let it repair itself with an increased memory configuration. At 15:40 CEST, all nodes were stable again. We were able to enable the requests to our Load Balancer again. From this moment on, the system was back up and running. A small backlog had build-up during the downtime, but the backlog was handled within 20 minutes and did not cause a severe impact on the functionality of the Stream API.
An increased load to our Stream API and a combination of insufficiently powerful nodes to carry the cluster's full load were the main reasons for the instability.
To prevent this from happening again, we immediately enabled strict rate limiting for overusers of our API. Ever since, the Stream API has been running stable again.
No data was lost on our end. All requests that returned a successful status code (in the 2xx domain) were received and processed. All requests that returned a 500 status code should be resent if needed.
REMEDIAL ACTIONS PLAN & NEXT STEPS
Next week, we will enable strict rate limiting for all Stream API users. We will allow clients to make 30 requests per second per Stream. Till now, this was only a recommendation, and the hard limit was higher.
Additionally, we plan to extend our cluster next week. The servers have already been ordered and prepared.
We also have planned an action to review our maintenance documentation, review the guide on stability recommendations, and make changes where needed.
Productsup is committed to continually and quickly improving our technology and operational processes to prevent outages of our Stream API. Again, we appreciate your patience and apologize for the impact on you, your users, and your organization. We thank you for your business and continued support.