Route optimization delays

Incident Report for Onfleet

Postmortem

Overview

From around July 6, 2021 to July 13, 2021 Route Optimization within Onfleet was slower than normal. Users were most affected between July 8, 2021 and July 12, 2021, experiencing times from submission to completion which could be 3-5x slower than expected on average. Ultimately there were two root causes for this incident which delayed resolution.

‌

Root cause 1: Upstream provider

Between around July 8, 2021 and July 9, 2021 our upstream provider for related geospatial services experienced latency problems related to infrastructure changes. These issues were resolved around early evening PDT on the 9th and were responsible for much of the slowness experienced during this period.

‌

Root cause 2: Bug in code related to polling for completion

As noted above, however, users did experience slowness before and after our provider experienced these issues. Ultimately, this slowness was related to a bug in how we check for the completion of submissions for route optimization. From the perspective of our backend, we receive a request for route optimization, validate it, and map it into a request (or requests) with various tuned parameters, and then send a problem specification to a route optimization engine. We then poll for the completion of related request(s), process the response, and apply the relevant mutations and operations as needed.

We poll in a distributed fashion from many instances of the relevant backend service. To ensure that we process the result only once, we have a protective barrier which prevents multiple instances from being within the critical path of this processing at once. In rare cases, multiple instances can attempt to enter this region at the same time. There was a bug in this inner code which caused the service to stop polling if a specific timing of this sequence occurred. While this only occurred in ~0.01% of cases, given the volume of requests we receive, this led to more than half of these instances not polling which exacerbated the above issue.

We identified the above issue on July 13, 2021, validated it, and released the fix within ~2 hours.

‌

Looking forward

Given the nature of the bug present within our service, we added more metrics and logging to ensure that this kind of issue can be detected automatically and so we can better monitor these kinds of interactions. We apologize to affected users. We have a significant amount of monitoring in place but in this instance there was a gap which made the issue more difficult for us to detect. We will continue to treat monitoring and the availability and responsiveness of our product as critical and essential.

Posted Jul 15, 2021 - 14:10 PDT

Resolved

All route optimization requests are now being produced within reasonable durations. We will produce a more detailed report of what happened over the coming days.

Posted Jul 13, 2021 - 20:16 PDT

Investigating

We are working with our provider to understand why route optimization solutions are taking longer than usual to be computed.

Posted Jul 12, 2021 - 10:51 PDT

This incident affected: Route Optimization.