Description: IT Visibility - NA - Data Processing was Delayed
Timeframe: May 26th, 1:51 AM to June 9th, 5:56 AM PDT
Incident Summary:
There have been multiple occurrences recently with the IT Visibility Dashboard not refreshing periodically impacting customers' ability to retrieve the most recent data. On Thursday, May 26th, at 1:51 AM PDT, technical staff identified a reoccurrence of the IT Visibility data currency issue. This specific incident impacted the US customers, and data may have been up to 20 hours behind.
To alleviate the issue, technical staff deployed multiple optimizations in production. On May 27th, at 2:49 AM PDT, after successful testing in the lower environment, database services were re-allocated, and additional instances were added to enable faster data retrieval.
On May 31st, after additional monitoring over the weekend, technical staff observed that backlog processing was still behind. As a short-term solution, staff disabled the non-critical workload from the database and engaged SMEs from the other areas to assist with the long-term solution.
Meanwhile, technical staff continued to test multiple optimizations in the lower environments. After internal discussions and further investigation, technical staff manually removed the obsolete data to allocate extra space occupied in the database.
After overnight monitoring, on June 1st, technical staff observed a significant improvement in the data processing speed, following which technical staff automated the process of clean-up of redundant data in the database. In addition, technical staff also deployed additional instances to the remaining services in the database.
Over the next few days, staff continued to analyze, isolate, and eliminate any contributing factors causing the performance degradation and observed significant improvement in the data processing speed.
On June 9th, at 5:56 AM PDT, following additional health checks and monitoring, this incident was declared resolved.
Root Cause:
The flow of data was impaired due to insufficient allocation of memory resources in the environment.
Contributing Cause:
There was a large amount of obsolete data in the database, which required manual intervention to run a cleanup and allocate extra space for incoming requests
Corrective Action:
• Database services were re-allocated, and additional instances were added to enable faster data retrieval
• Redundant data was removed from the database and additional optimizations were deployed to enhance data processing
• As a long-term fix, technical staff will continue to work on a roadmap for Q3 to transition to a more viable solution