Overview:
At 07:11 PDT on Sep 29, we began rate limiting geocoding requests to our upstream provider due to an excessive amount of requests from a customer testing their integration. Due to a bug in our code, these requests were allowed to continue for several hours before they triggered rate limiting. We deployed a fix at 09:05 PDT and processing returned to normal.
What Happened:
Around 00:40 PDT, a customer began testing their API integration. Their testing was creating invalid tasks due to an incorrect billing status and a bug in the code that verifies billing authorization did not fully prevent this API request activity. Each task creation attempt caused a geocoding request to our upstream provider. After several hours of sustaining the same request pattern, the problematic activity caused our system to exceed the internal rate limits for our upstream geocoding provider at 07:11 PDT. These rate limits had not been updated recently and unfortunately reflected levels considerably lower than those in our contract with this provider.
Under normal circumstances, automatic monitors in our systems would have detected this customer’s activity and allowed us to mitigate the issue before it became critical. However, due to a bug in our monitoring, these requests were not tracked correctly and so the alerting never took place.
What we have implemented and will do in the future:
We have adjusted the external geocoding rate limits to match our current capabilities. We are now in the process of testing our fixes to the underlying bug for deployment in the coming days. We are reviewing all our geocoding monitoring to make sure that we have alerts set on the appropriate data and conditions. We do apologize for this geocoding interruption and will continue to enhance our monitoring and service configuration to reduce the likelihood of edge cases such as this one occurring in the future.
As always, if you require further detail, please do not hesitate to email us at support@onfleet.com.