Issue with API Monitoring Tests and API
Incident Report for BlazeMeter
Postmortem

Incident details with time:

06/30/2020 11:29AM PDT to 12:00PM PDT => 3 internal incidents were triggered by our SaaS Monitoring for drop in API traffic and Test runs:

Problem symptoms:

  1. Observed “max number of clients limit reached” exceptions and “health-check” failures in our internal Calculon service
  2. Observed queue spikes resulting in Tests getting queued up
  3. Observed drops in Transactions Per Second (TPS) for APIs initially, and later observed the same system wide

Root cause of the problem

  1. The drop in API TPS followed by the “max number of clients reached” exception and Calculon microservice health check failures, enabled us to quickly narrow down the issue at hand to one of the Redis DBs “Redis-calculon”. As we observed that this Redis DB maxed out its clients connections two times during this window, we suspected any of the Redis COMMAND(s) came into this Redis server during that window might have been BLOCKED for longer time (and hence this client connection spikes) due to multiple reasons including but not limited to – a) any foreign script got triggered outside of the application context that runs any BLOCKING commands b) IOPS and other kernel, N/W & H/W limitations/issues on that node, if any c) connection contentions at application clients itself.
  2. So, to mitigate the connection contentions and threshold limit of the impacted Redis instance and to be able to bring back the calculon microservice to a healthy state, we did a failover of this Redis server and updated application clients and relinquished all its inflight connections to point to the newly failed over instance (the new master).
  3. Immediately within a few secs of failover, the runscope Calculon microservice became healthy and API and other services TPS builds returned back to normal. Also tests execution picked up to its normal velocity.
  4. We connected the impacted Redis instance (old master) as slave to the new master instance and enabled debug to triage the issue further – Upon further triaging on this impacted instance, we observed that there was a housekeeping script which otherwise would normally be triggered on the slave instances only, got triggered accidently on this master at the wrong time window at ~ 11:15AM PDT on 06/30 – we normally run these housekeeping scripts during a dedicated maintenance schedules which are carefully chosen hand-picked time windows with low traffic and low impact to the systems.

What we are doing to avoid recurrences in future:

  1. Automate the instances details update process upon any addition/deletion of DB hosts to/from the production cluster – especially Redis as it normally involves frequent failovers and removals.
  2. Revisit and re-approve instance access controls with more stringent reviews and approvals.
  3. Enable housekeeping/profiles jobs via Rundeck with proper access and approvals
Posted Jul 07, 2020 - 08:19 PDT

Resolved
We have confirmed that things are back to normal in API Monitoring. We are closing this incident.
Posted Jun 30, 2020 - 13:48 PDT
Monitoring
API Monitoring should be functioning normally now. We are monitoring to see that it is stable.
Posted Jun 30, 2020 - 12:25 PDT
Update
We have identified the issue and will provide an update shortly.
Posted Jun 30, 2020 - 12:12 PDT
Investigating
We are aware of an issue with API Monitoring functionality including running tests, scheduled tests, and accessing via API. We are investigating and will provide an update as soon as possible.
Posted Jun 30, 2020 - 12:00 PDT
This incident affected: API Monitoring (Runscope).