High latency for API and dashboard

Incident Report for Onfleet

Postmortem

On 2022-03-23 at 12:08 PDT, Onfleet engineers noticed abnormal process-related alerting and a rapid increase in API response times. Several IP addresses were observed accessing the API which were significantly exceeding their usual request volume. These spikes in access seemed to be a factor in the abnormal load patterns along with other system parameters out of range. At 12:59 PDT, a change was inadvertently introduced which caused the dashboard to become inaccessible - this lasted 18 minutes before the change was reverted. Other changes were made to address abnormal load. By 15:18 PDT, response times returned to normal and metrics stabilized. A set of changes was subsequently deployed to reduce the load on these instances and to return them to normal operating parameters.

Onfleet routinely manages access to its systems to control harmful or abnormal traffic patterns when necessary. During this incident several IPs associated with what seemed to be abnormal traffic patterns were blocked. Onfleet never seeks to block any non-malicious activity unless absolutely necessary. Onfleet apologizes for any impact that these actions may have caused. If you have questions about how to improve your integration and its access patterns, Onfleet can help to answer any questions you may have.

Onfleet wishes to sincerely apologize for the interruption throughout the course of this incident. While the incident was addressed and the service was accessible during this period, there are several ways in which the response could have been improved. The Onfleet devops group is updating how it evaluates and responds to abnormal traffic patterns. Onfleet will introduce better tooling to reduce the chance of performing operations which could have adverse effects due to human error. Monitoring will also be put in place to provide leading indicators like abnormal load patterns.

Posted Mar 29, 2022 - 14:57 PDT

Resolved

This incident has been resolved.

Posted Mar 23, 2022 - 16:29 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 23, 2022 - 15:31 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 23, 2022 - 12:39 PDT

Investigating

We are currently investigating this issue.

Posted Mar 23, 2022 - 12:21 PDT

This incident affected: Dashboard and API.