The iAdvize livechat solution has encountered a major instability degrading the user experience during the night of May 8th to May 9th, 2022 between 9:10 pm and 8:20 am CET.
During this time, the display of contact elements on the Chat, Call and Video channels was impossible due to loading errors on iAdvize resources. Incoming conversations on these channels were largely cut off.
Conversations from social channels (Facebook, Twitter, Whatsapp, Apple Messages for Business, SMS) and mobile applications (via SDK), as well as the conversation panel and the iAdvize administration were not affected by this incident.
This incident is the consequence of a succession of two events:
To summarize, the fewer servers there were up, the harder it was to restart a new one via our auto-scaler.
Our technical team carried out several successive actions to mitigate and correct this incident as soon as possible.
We increased the number of servers needed for the proper functioning of the iAdvize livechat solution. These actions have mitigated the problem for few minutes but was not enough. All new servers started were instantly killed.
We also tried to change the type of server requested from our host provider in order to have more powerful machines. The results were not positive.
The solution was found by temporarily turning off our auto-scaler tool and manually starting a large number of servers in parallel.
By doing this, we were able to handle the incoming load without the risk of having auto-scaler related movements on the servers being started.
Update of the configuration of our automatic platform scaling tool in order to be more robust :
All these actions are being processed or are already online.
Regarding these two points identified in the first version of this post mortem :
We’ve finally identified the root cause of these servers lose. It is still linked to our auto-scaler tool. It did detect an increase in load during the afternoon of May 8, but was unable to start new servers. Around 9:10 pm. it reached a critical level. The platform's sizing was then no longer adapted to the incoming traffic. Almost all the servers were then killed with the impossibility of restarting new ones in a progressive way.
Next steps are: