Package Repository Outage
Incident Report for CloudRepo
Resolved
Customer Impact: Access to our storage APIs (publishing/reading packages) was returning 500 errors for some partners.

Root Cause: Our servers exhausted their connections to the storage layer and our monitoring system did not alert us to this degraded state - a partner alerted us instead.

Resolution: After we were alerted of this issue we were able to restore functionality to all partners.

Duration: Approximately two hours

Future Mitigation: To prevent this from happening again, we will be implementing several changes:

1) Improve our monitoring to detect 500 errors as soon as they occur.
2) Increased the size of our cluster to give us more headroom in our connection pools.
3) Continue to investigate root cause and fix anything that may be holding on to connections.
Posted May 09, 2018 - 12:00 CDT