Slowdowns on our platform impacting the function of notifications and chats on your Website.

Incident Report for iAdvize (HA)

Postmortem

11th May update

The iAdvize livechat solution has encountered a major instability degrading the user experience during the night of May 8th to May 9th, 2022 between 9:10 pm and 8:20 am CET.

During this time, the display of contact elements on the Chat, Call and Video channels was impossible due to loading errors on iAdvize resources. Incoming conversations on these channels were largely cut off.

Conversations from social channels (Facebook, Twitter, Whatsapp, Apple Messages for Business, SMS) and mobile applications (via SDK), as well as the conversation panel and the iAdvize administration were not affected by this incident.

Reasons

This incident is the consequence of a succession of two events:

We lost, in a few minutes and for a reason being identified, 95% of our servers needed to load the iAdvize livechat.
Only the livechat servers were lost. The other services of the iAdvize platform did not experience this event.
At the same time, our automatic platform scaling tool (= auto-scaler) tried to provide new servers but was unable to do so. All started servers were killed within minutes due to too much incoming load for the requested server type.

To summarize, the fewer servers there were up, the harder it was to restart a new one via our auto-scaler.

Resolution

Our technical team carried out several successive actions to mitigate and correct this incident as soon as possible.

We increased the number of servers needed for the proper functioning of the iAdvize livechat solution. These actions have mitigated the problem for few minutes but was not enough. All new servers started were instantly killed.

We also tried to change the type of server requested from our host provider in order to have more powerful machines. The results were not positive.

The solution was found by temporarily turning off our auto-scaler tool and manually starting a large number of servers in parallel.

By doing this, we were able to handle the incoming load without the risk of having auto-scaler related movements on the servers being started.

Actions for the future

Extensive research into the cause of the loss of 95% of the live chat servers
Depending on the results, implementation of preventive actions to avoid these sudden losses
Update of the configuration of our automatic platform scaling tool in order to be more robust :
- Increase of the incoming load capacity of the servers
- Increase of the minimum number of servers to be maintained
- 3 by 3 sizing of servers to be more resilient to incoming load in case of a sudden drop in the number of servers online

All these actions are being processed or are already online.

‌

17th May update

Regarding these two points identified in the first version of this post mortem :

Extensive research into the cause of the loss of 95% of the live chat servers
Depending on the results, implementation of preventive actions to avoid these sudden losses

We’ve finally identified the root cause of these servers lose. It is still linked to our auto-scaler tool. It did detect an increase in load during the afternoon of May 8, but was unable to start new servers. Around 9:10 pm. it reached a critical level. The platform's sizing was then no longer adapted to the incoming traffic. Almost all the servers were then killed with the impossibility of restarting new ones in a progressive way.

Next steps are:

Make some fine-tuning to our current auto-scaling tool. i.e. its ability to immediately provide more compute/memory power when the load requires it.
Explore the possibility of using another auto-scaling tool and migrate to it if the robustness is better.

Posted May 11, 2022 - 09:46 CEST

Resolved

This incident has been resolved.
A post-mortem review will be published afterwards.

Posted May 09, 2022 - 16:34 CEST

Monitoring

The issue is resolved following our last actions.
We are now monitoring the solution.

Posted May 09, 2022 - 08:42 CEST

Update

The problem is mitigated. It has something to do with a lack of instance running.
We are trying to identify the root cause and to work on a fix.

Posted May 09, 2022 - 08:27 CEST

Update

We are continuing to investigate this issue.

Posted May 09, 2022 - 01:21 CEST

Update

We are continuing to investigate this issue.

Posted May 09, 2022 - 00:57 CEST

Update

Please know that a few actions have been done by our technical team.
Those seem not to be sufficient to come back to a normal activity.

Further investigations are needed.

Posted May 09, 2022 - 00:04 CEST

Investigating

Our technical team is investigating on slowdowns on our platform. It may impact The display of notifications on your Website.

Posted May 08, 2022 - 23:28 CEST

This incident affected: Onsite Channels (Chat, Call, Video) and Visitor’s interface (Engagement Notification, iAdvize Messenger).