An underlying storage system corruption, linked to a single row in the database table where scoring errors are saved, caused delays in the asynchronous scoring queue. Prior to correction, queries run against this table could take anywhere from a partial second up to several seconds to complete, and this impact was compounded by an unusually high number of sessions scoring errors submitted at one time.
Because this issue only surfaced when rare scoring errors were logged, it was not immediately detected. Originally, the number and size of RDS instances were increased to work through the scoring backlog but the queue cleared before this process could be completed. This led us to the fact that delays only occurred when errors were persisted, and led to the discovery, and deletion, of the problem table row.
Additional measures have been put in place to monitor the write speed of errors, as well as session data, and the issue has not resurfaced since then.