Incident summary
During an internal process that archives data, we noticed that disk usage beginning to increase and decided to upgrade the volume proactively. Due to internal AWS optimization processes, the upgrade created slowness in the system, which later led to the incident. We promoted a replica database to restore the service and service was restored at 11:45am PST.
9:30am PST - we started an internal process that archives data
10:30am PST - internal monitoring systems alerted fast increasing disk usage
10:35am PST - the volume attached to the database servers was upgraded
This change resulted in degraded database performance.
Due to internal AWS optimization processes, the volume upgrade created slowness in the system, which later led to the incident starting at 10:42am PST.
Customers hosted on shared instances were not able to use the system from 10:42am PST to 11:45am PST.
Affected services:
The Incident was detected by the automated monitoring system and was reported by multiple customers.
After receiving the alerts from the monitoring system, the engineering team connected with ShipHawk Customer Success and described the level of impact. The incident notification was posted to https://status.shiphawk.com/
3 steps were performed for the service recovery:
All times are in PST.
10/15/2021:
10:00am - an internal process that archives data started
10:30am - internal monitoring systems alerted fast increasing disk usage
10:35am - the volume attached to the primary database node was upgraded
10:42am - the database performance degraded
10:43am - the monitoring system alerted multiple errors and API unresponsiveness
10:50am - the engineering team began an investigation of the incident
11:20am - the root cause was understood and the team created an action plan
11:30am - primary node was disabled and the replica was promoted to a primary
11:40am - OLD primary node hostname was pointed to the NEW primary node by updating DNS records
11:45am - the service is fully restored
1:30pm - a new database replica was created and the sync process started
10/16/2021:
2:30pm - the new database replica sync process finished
The difference in configurations of the test and production systems led to missed inefficiency in the data archiving process.