Degraded Experience on Several APIs
Incident Report for Xendit
Postmortem

What happened?

At 2023-09-12 6:00 AM WIB we performed a planned maintenance to one of the production clusters. This maintenance activity was planned without downtime expected.

At 2023-09-12 6:35 AM WIB we completed the maintenance but identified the deployments were not in a healthy state and identified degraded performance in our Payments, Payouts, Credit Card, Checkout UI, Subscription, and XenPlatform’s Transfer APIs, resulting in customers getting 500/503 HTTP error response codes.

At 2023-09-12 6:45 AM WIB, we initiated the rollback of the changes, and we began to see issues getting partially resolved.

At 2023-09-12 7:14 AM WIB, we completed the recovery process and resumed processing new incoming requests.

Our investigation revealed an unexpected edge case during the infrastructure maintenance of one of our production infrastructure clusters, causing this outage. The cluster was one of the last two clusters we aimed to upgrade. This edge case was not found during the testing phase and upgrades of other production clusters.

What measures will we take to prevent this issue in future?

  1. Improve our testing environment to cover more edge cases of infrastructure configurations in production, and to mitigate those earlier in the testing phase.
  2. Expedite the completion of enhanced deployment tool roll out to ensure configuration consistency of testing and all production clusters.
  3. Increase the coverage of our automated checks to detect infrastructure configuration error and unhealthy deployments before rolling out to production.

We understand that you are counting on our reliability for the smooth operation of your business. We sincerely regret any inconvenience caused to you and your customers. We are committed to do better by applying our learnings from this event to continuously improve our services to serve you better.

If you require any assistance or have further questions, please contact us at help@xendit.co or through live chat at https://www.xendit.co/.

Thank you for your trust in using Xendit to power your business.

Posted Sep 13, 2023 - 16:06 WIB

Resolved
This incident has been resolved. Affected APIs are Payment APIs, Credit Card APIs, Payout APIs, and Xenplatform's Transfer API. Customers are safe to retry affected requests.

Remaining APIs other than the mentioned APIs above are not impacted by this incident and are operating normally.

We apologize for the inconvenience caused and we will share more update via post-mortem.
Posted Sep 12, 2023 - 07:30 WIB
Monitoring
We fixed the root cause of the network issue and are monitoring traffic
Posted Sep 12, 2023 - 07:26 WIB
Update
We are continuing to investigate this issue.
Posted Sep 12, 2023 - 07:15 WIB
Update
We are continuing to investigate this issue.
Posted Sep 12, 2023 - 07:13 WIB
Update
We are continuing to investigate this issue.
Posted Sep 12, 2023 - 07:07 WIB
Investigating
Dear Valued Customer,

We noticed a degraded performance in a few of our APIs. Several APIs such as Payment Request APIs, Payout APIs, Credit Card APIs are affected
Impact: During the disruption, your API request will get a 500/503 HTTP response code.

We're currently investigating the full impact and issues.
Posted Sep 12, 2023 - 06:55 WIB
This incident affected: API (Cards, Payouts, xenDisburse, Invoice, Payment APIs).