Ambra Incident

Incident Report for Ambra status

Postmortem

Due to increasing resource utilization, during a planned maintenance window on February 18 we increased the memory of our caching/queueing component. During this maintenance we also changed the CPU class for increased consistency with our other systems, including our UAT environment. The modified system initially performed normally, but as platform traffic increased on February 19 we began experiencing increased operation latency, leading to degraded API performance. Modifying these components requires complete platform downtime to ensure consistent queue processing, so we first attempted to increase the number of front-end servers in order to reduce load on the backend systems. Performance improved temporarily but began degrading again as the platform reached peak time of day. We took an emergency platform outage in order to revert to the original CPU class at which point performance returned to normal levels.

Posted Mar 06, 2024 - 12:33 EST

Resolved

The incident has been fully resolved and service is back to normal levels. Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.

Posted Feb 19, 2024 - 18:01 EST

Update

We believe that the incident is resolved. Users can log in. The user interface is back to normal performance. However, we have a services backlog that we are working through.

Our team will be conducting a root cause analysis and sharing as soon as possible. We will continue to monitor the situation to ensure there are no further issues.

Posted Feb 19, 2024 - 14:50 EST

Monitoring

The Ambra Emergency Maintenance is complete. Ambra is and users can log in and navigate. We are continuing to investigate and will provide further updates as soon as possible.

Posted Feb 19, 2024 - 14:38 EST

Update

We are restarting the Ambra instance for emergency maintenance at 2:10 pm ET. Ambra will be down for approximately 30 minutes. More information will be posted as soon as it is available.

Posted Feb 19, 2024 - 14:04 EST

Update

Our engineering teams are focused on identifying the root cause of the incident and is dedicating all available resources to the investigation. We are working around to resolve the issue and will provide updates as soon as we have more information.

Posted Feb 19, 2024 - 13:51 EST

Update

Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.

Posted Feb 19, 2024 - 13:00 EST

Investigating

Our engineering teams are focused on identifying the root cause of the incident and is dedicating all available resources to the investigation. We have also added 12 more additional interactive nodes to lessen the impact. We are working to resolve the issue and will provide updates as soon as we have more information.

Posted Feb 19, 2024 - 12:32 EST

Update

The additional interactive have been added. We are still investigating the root cause and next steps. Additional information will be provided as soon as it is available.

Posted Feb 19, 2024 - 11:45 EST

Identified

The issue has been isolated interactive services nodes. We are currently working to provision additional interactive service nodes to resolve the issue.

Posted Feb 19, 2024 - 11:07 EST

Update

Our engineering teams have not yet identified the root cause. We are continuing to investigate and will provide further updates as soon as possible.

Posted Feb 19, 2024 - 10:37 EST

Investigating

We have received reports of issues on the Ambra platform and slowness. Engineering teams are currently investigating. Additional information will be provided as soon as it is available.

Posted Feb 19, 2024 - 09:57 EST

This incident affected: Web Services, Image Processing, and Image Viewing.