Performance Degraded / Devices Disconnected
Incident Report for Teem
Postmortem

Earlier this month, we notified you of the unexpected technical challenges some customers experienced as a result of a major infrastructure upgrade for the Teem platform.

We take performance and security very seriously, which is why we initiated the update to align with SOC 2 Type 2 data security standards. We appreciate your patience and understanding while we worked to resolve the temporary outages.

Our first priority was to get you back up and running; our second is to give you some background on what happened.

Here’s what we determined through our root cause analysis and how we’re preventing this moving forward.

Issue: O365 Sign-in Error 500

Cause: During a routine deployment, one of the third-party software libraries Teem SSO relies on was inadvertently upgraded to the latest major release of that library, which included noticeable changes.

Remediation:

· Reverted the library by specifying previous version with package management to avoid any unintentional upgrades

Issue: Post-Upgrade Device Connectivity

Cause: Teem’s core service had an interruption starting on Jan.9, 2021. During the interruption, when the EventBoard device made API calls to the service, the response status could be one of many error codes, including 401 Unauthorized. While this status code was in error, the device executed its designed security protocols and logged off. During a logout, EventBoard deletes all API tokens, downloaded themes, settings, and calendar data. It then reverts to a not-signed-in state and provides a 6-digit pin code for reactivation. Some customers were stuck on a “Authenticating with Teem …” screen. In these cases, after logging out EventBoard showed a message saying it could not communicate with Teem, and selecting “Retry” locked the app on that screen (a secondary symptom of the core issue). After logging out, the core service interruption would return an error instead of a pin code

Remediation:

· Deployed hotfix allowing devices to automatically activate and log in at pin code screen if they still exist in Teem database and are connected to only one Teem customer instance

· Modified EventBoard app to increase fault tolerance on false 401s and to no longer get stuck at “Authenticating with Teem …” screen

· Modified core service (monolith) so it doesn’t return 401s incorrectly

Issue: App.Teem.com Platform Outage

Cause: When deploying code on Jan. 14, 2021, an errant pip upgrade caused servers to not receive the deploying code and services to be stopped, interrupting all aspects of Teem.

Solutions:

· Updated deploy script to pin pip version

· Cleared salt cache and confirmed correct deploy script on servers

· Changed canary process to better detect downed servers

· Ongoing: Updating underlying framework and all packages

We apologize for any inconvenience and are continually working toward providing a more reliable experience for you.

For additional information or to report an issue, please reach out to your Account Manager or visit help.teem.com to contact our Customer Support team.

Thank you

Posted Jan 29, 2021 - 15:17 MST

Resolved
Our Engineering team has identified a solution to the current incident. A separate communication has been sent to customers who we believe were impacted by this incident. Please contact our support team at help.teem.com if you are still experiencing issues and did not receive additional communication from us. We will be posting a Root Cause Analysis of this incident in the immediate future. At this time, we will close this incident as resolved.
Posted Jan 14, 2021 - 15:35 MST
Update
At this time, we have identified the subset of customers who are affected by this incident. We will continue to investigate the root cause of the incident and our Engineers are monitoring the system for any additional issues. Customers that we believe are affected will receive additional communication in the next two hours.
Posted Jan 14, 2021 - 12:07 MST
Update
Our engineers are continuing to investigate the current incident. The next update will be provided at 12pm MST.
Posted Jan 14, 2021 - 10:00 MST
Update
Our engineers are continuing to investigate the current incident. The next update will be provided at 10am MST.
Posted Jan 14, 2021 - 07:54 MST
Update
We continue to investigate the current incident and another update will be provided at 8am MST.
Posted Jan 14, 2021 - 06:01 MST
Update
Our engineers continue to investigate the current incident. Another update with be provided at 6am MST.
Posted Jan 14, 2021 - 04:02 MST
Update
We are continuing to investigate and will provide another update at 4am MST.
Posted Jan 14, 2021 - 02:06 MST
Update
We are continuing to investigate and will provide another update at 2am MST.
Posted Jan 14, 2021 - 00:31 MST
Update
We are continuing to investigate and will provide another update at 12 am MST.
Posted Jan 13, 2021 - 22:02 MST
Update
We are continuing to investigate and will provide another update at 10pm MST.
Posted Jan 13, 2021 - 19:48 MST
Update
Our Engineering team is continuing their investigation to identify the root cause of this issue. We will provide another update at 7pm MST”.
Posted Jan 13, 2021 - 16:58 MST
Investigating
At this time, our monitoring has indicated that performance has improved significantly. However, we are receiving reports of devices that are disconnected from Teem servers. Our reports indicate that this is affecting a subset of our customers. Given the impact to these customers, we are elevating this incident to a Severity 1. As a Severity 1 incident, we will provide updates every two hours. We ask affected customers to reach out to Teem support at help.teem.com if they have not done so previously
Posted Jan 13, 2021 - 14:51 MST
Update
At this time, performance has improved and we will continue to monitor. We will be sending communication to customers who have been known to be affected by this incident. If you are still experiencing issues, contact support at help.teem.com.
Posted Jan 12, 2021 - 18:44 MST
Update
At this time we are continuing to monitoring performance and stability. Next update will be at 7pm MST.
Posted Jan 12, 2021 - 15:04 MST
Update
Our Engineering team continues to make performance enhancements and we are seeing improvements with functionality. Engineering will continue to monitor system stability and performance. The next update will be at 3pm MST.
Posted Jan 12, 2021 - 11:01 MST
Monitoring
The engineering team has identified an area of reduced performance and is in the process of making configuration adjustments to remove it from the system. This should result in improvements for most customers. The system will continue to be monitored for any further degradation and our next update will be at 1100 Hours MST.
Posted Jan 12, 2021 - 07:02 MST
Update
We are continuing to investigate this issue.
Posted Jan 11, 2021 - 19:02 MST
Investigating
We have received reports of performance issues on Teem Products and devices that are becoming disconnected from the server. Our Engineers are investigating the issues. Next update will be provided at 0700 Hours MST 1/12/20.
Posted Jan 11, 2021 - 19:02 MST
This incident affected: Web Interface, Mobile Data, API, Google Apps Calendar, Exchange Sync, Mandrill US East, Mandrill US West, EventBoard, and LobbyConnect.