On 11/3/ users encountered difficulties accessing the website and other system functionalities.
All times are PST Timezone.
Date/Time | Activity |
---|---|
2022-11-02 22:00:00 | An Amazon Web Services system failure causes the primary database to be restarted. The DrChrono EHR platform becomes inaccessible. |
2022-11-02 22:15:00 | Amazon Web Services automatically restarts the affected server, but a metadata loss causes other servers to stop being able to serve production database traffic as well. |
2022-11-02 22:15:00 | An engineer on the DevOps team comes online after being notified of the issue from automated alarming systems and starts investigating. |
2022-11-02 22:25:00 | The root cause of the issue is identified and the engineer reroutes traffic to preserve system availability. The DrChrono EHR platform is now accessible, but running at reduced capacity due to the loss of extra servers. |
2022-11-02 23:30:00 | After the first attempts to recover the affected servers directly fail, the DevOps engineer initiates a backup and restore process to prepare capacity for the upcoming day traffic for one of the affected servers. |
2022-11-03 01:30:00 | The rescue process does not succeed for the additional servers and it's determined that they will also need to be restored from backups. The process is initiated. |
2022-11-03 05:44:00 | Status page was created to notify customers about “General system slowness on the DrChrono platform”. |
2022-11-03 8:30:00 | One of the replica servers finalizes phase 1/3 of backup restore; the next steps are engaged immediately. |
2022-11-03 11:45:00 | One of the replica servers completes phase 3/3. Additional replica server restoration processes are initiated. |
2022-11-03 14:15:00 | Additional replica servers' process restoration fails. |
2022-11-03 16:27:00 | Additional replica servers' process restoration was attempted but also fails. |
2022-11-03 17:45:00 | A backup process from the primary DB started. |
2022-11-03 20:55:00 | One additional replica server finishes process restoration. |
2022-11-03 21:15:00 | Spun up 3 more database replicas as a backup solution. |
2022-11-03 22:30:00 | Added an additional 3 database replicas. |
2022-11-04 00:05:00 | 6 new database replicas are installed and configured. |
2022-11-04 01:40:00 | The backup from the primary database is finalized. |
2022-11-04 01:45:00 | The primary backup snapshot with fast restore option is started. |
2022-11-04 02:10:00 | The standby server started phase 1/3 of backup restore from the primary database. |
2022-11-04 04:00:00 | Restore process is completed. |
2022-11-04 07:00:00 | Status page was updated to resolved. |
2022-11-04 07:07:00 | One of the replica servers is removed. After reviewing the server response, the team found it was not at full capacity and will need a complete restoration. |
2022-11-04 08:10:00 | Tasks and message count display were blocked increasing the capacity of responses from the third server and primary. |
2022-11-04 08:42:00 | A status page was created to notify customers about the “Task and message counts are temporarily unavailable”. |
2022-11-04 08:42:00 | A replica server finalizes phase 2/3 of backup restore; the final phase is engaged. |
2022-11-05 17:00:00 | Replica restoration process and shutdown of other servers are completed. All DB servers were monitored for the next hours. |
2022-11-06 09:50:00 | Announced the system returned to full capacity and operational. |
2022-11-07 06:33:00 | Status page “Task and message counts are temporarily unavailable“ was resolved. |
Sentry alerts contain information to identify the current issue.
A configuration change was made in the database to match the previous behavior.
Users experienced overall system slowness affecting major functionalities such as errors copying notes, locking notes, saving information, and the likes.
The Ops team has fixed the issue in the database table and will roll out updates and connections to the database gradually. Outlining the exact process for any database configuration changes and mitigation/fallback procedures will prevent a similar situation from happening again.