On May 21st and May 24th, 2021, the EHR web servers experienced an increase in errors and queue times, impacting site and application performance.
Date/Time | Activity |
---|---|
2021-05-21 13:22 | Sporadic and unspecific reports for general site slowness reported |
2021-05-21 13:41 | 3rd party CDN reports having issues with connectivity for some regions |
2021-05-21 13:58 | Slowness issues identified as likely caused by a queue backup |
2021-05-21 14:12 | Queue backup resolved |
2021-05-21 14:15 | Issue appears to be resolved; Customer Escalation unable to reproduce previous issues |
2021-05-21 15:35 | Error rates are at normal levels |
2021-05-21 16:35 | No new tickets or reports of site slowness |
2021-05-21 16:36 | Site issues resolved; status page updated |
2021-05-24 11:10 | First reports of an increase in tickets for site slowness. |
2021-05-24 11:12 | Ops begins investigation for possible issues. |
2021-05-24 11:20 | Queue times identified as being higher than normal. |
2021-05-24 11:30 | High load on the primary databases is identified; a running report/import is suspected of being the issue. |
2021-05-24 11:45 | Users are still experiencing site slowness; queue times are still high. |
2021-05-24 11:47 | Status message posted for “Site Slowness”. |
2021-05-24 11:52 | Import jobs have increased load on work queues but are processing. The increased load on the work queues is thought to be contributing to perceived slowness for some operations. |
2021-05-24 12:10 | Causes for increased load against the primary databases are identified and remediation begins. |
2021-05-24 13:30 | Remediation for load on the database is complete, but site slowness persists. |
2021-05-24 13:35 | A separate cause for an increase in errors for the web servers is identified. |
2021-05-24 13:47 | Status page updated to “Identified”. |
2021-05-24 14:00 | Issue causing increase in errors is traced down to a code commit from 05/20/2021. |
2021-05-24 14:10 | A hotfix to mitigate the issue is prepared. |
2021-05-24 14:30 | Hotfix deployment begins. |
2021-05-24 15:15 | Deployment is complete; web queue times are back at baseline, errors have ceased. |
2021-05-24 15:18 | Status page updated to “Operational”. |
2021-05-24 16:07 | Incident declared resolved, status page updated. |
A code change made during a planned production deployment on May 20th, 2021 led to an increase in errors during peak traffic hours on May 24th, 2021. A corresponding increase in database load was incorrectly identified as a contributing factor, leading to increased time to resolution.
Most users would have experienced an increase in site slowness and errors while using the EHR application.
Once identified, a code hotfix was created and deployed to our production environment to mitigate the code causing the errors.