Site Slowness
Incident Report for DrChrono
Postmortem

Site Slowness

Summary

On May 21st and May 24th, 2021, the EHR web servers experienced an increase in errors and queue times, impacting site and application performance.

Timeline (EST, 24-hour clock)

 

Date/Time Activity
2021-05-21 13:22 Sporadic and unspecific reports for general site slowness reported
2021-05-21 13:41 3rd party CDN reports having issues with connectivity for some regions
2021-05-21 13:58 Slowness issues identified as likely caused by a queue backup
2021-05-21 14:12 Queue backup resolved
2021-05-21 14:15 Issue appears to be resolved; Customer Escalation unable to reproduce previous issues
2021-05-21 15:35 Error rates are at normal levels
2021-05-21 16:35 No new tickets or reports of site slowness
2021-05-21 16:36 Site issues resolved; status page updated
2021-05-24 11:10 First reports of an increase in tickets for site slowness.
2021-05-24 11:12 Ops begins investigation for possible issues.
2021-05-24 11:20 Queue times identified as being higher than normal.
2021-05-24 11:30 High load on the primary databases is identified; a running report/import is suspected of being the issue.
2021-05-24 11:45 Users are still experiencing site slowness; queue times are still high.
2021-05-24 11:47 Status message posted for “Site Slowness”.
2021-05-24 11:52 Import jobs have increased load on work queues but are processing. The increased load on the work queues is thought to be contributing to perceived slowness for some operations.
2021-05-24 12:10 Causes for increased load against the primary databases are identified and remediation begins.
2021-05-24 13:30 Remediation for load on the database is complete, but site slowness persists.
2021-05-24 13:35 A separate cause for an increase in errors for the web servers is identified.
2021-05-24 13:47 Status page updated to “Identified”.
2021-05-24 14:00 Issue causing increase in errors is traced down to a code commit from 05/20/2021.
2021-05-24 14:10 A hotfix to mitigate the issue is prepared.
2021-05-24 14:30 Hotfix deployment begins.
2021-05-24 15:15 Deployment is complete; web queue times are back at baseline, errors have ceased.
2021-05-24 15:18 Status page updated to “Operational”.
2021-05-24 16:07 Incident declared resolved, status page updated.

 

Contributing Factors

A code change made during a planned production deployment on May 20th, 2021 led to an increase in errors during peak traffic hours on May 24th, 2021. A corresponding increase in database load was incorrectly identified as a contributing factor, leading to increased time to resolution.

Impact

Most users would have experienced an increase in site slowness and errors while using the EHR application.

Corrective Actions

Once identified, a code hotfix was created and deployed to our production environment to mitigate the code causing the errors.

Posted Jun 28, 2021 - 06:20 PDT

Resolved
This incident has been resolved.
Posted May 21, 2021 - 13:36 PDT
Investigating
We are currently investigating reports of sitewide slowness and trouble saving data that appear to have begun at approximately 10:15 PST. We will provide an update with additional information as soon as possible.
Posted May 21, 2021 - 11:30 PDT
This incident affected: drchrono.com, drchrono iPad EHR, and drchrono iPad Check-In Kiosk Application.