Experiencing high error rates in API/RCD/Search
Incident Report for Reflektion, Inc
Postmortem

Timeline (in PDT)

15:17 - Internal alerts fired for increased response times and error rates for search/recommendation/API services

15:17 - Incident created on Status Page to notify subscribers about an issue under investigation

15:20 - One of the underlying databases used by the online services was experiencing increased latency. Engineering decided to shift endpoints to a backup database to mitigate issue temxporarily

15:25 - Unusual throughput burst for a peculiar request detected from a single IP address, which triggered a chain of slow queries to the database

15:30 - The IP address was blocked from accessing Sitecore API services. Engineering restores database endpoints to original state

15:35 - Response time and error rates subside and all services are back to normal

15:35 - Status Page incident is marked resolved

Root Cause

The issue was a combination of both, a sudden throughput from a single IP address, and an unusual request that triggers slow queries to the database backing the search/recommendation/API services

Resolution and Next Steps

  • Add defensive checks and sanitization to prevent triggering of slow queries to the database
  • Incorporate anomaly detection and block burst traffic to Sitecore services from a single IP address
Posted Mar 24, 2022 - 14:33 PDT

Resolved
This incident has been resolved.
Posted Mar 23, 2022 - 15:35 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 23, 2022 - 15:34 PDT
Investigating
Currently we are experiencing high error rates in API/RCD/Search from 2:58 PDT. We are looking into it.
Posted Mar 23, 2022 - 15:17 PDT
This incident affected: Production Recommendations, Production Search, and Production API.