At 14:53 UTC on July 18, 2021, we observed elevated latency and error rates in our History, Push Device Registration, and Channel Groups services. We brought new capacity online to help what appeared to be malfunctioning instances in our History service. Once the new instances were brought into service, the issue was resolved at 15:41 UTC. Unfortunately, due to a process error while the malfunctioning instances were taken out of service, the nodes were terminated before we could perform an investigation, losing their state, and the root cause of the malfunction remains unknown.
To prevent a similar issue from occurring in the future we are updating our processes to ensure that malfunctioning nodes are taken offline in a way that will preserve their state for analysis. The replacement systems have been operating normally for over two days, and the system is stable. Our team continues to monitor closely.