EventBoard and Customer Portal Service Disruption
Incident Report for Teem
Postmortem

Earlier this month, we notified you of the unexpected technical challenges some customers experienced as a result of a major infrastructure upgrade for the Teem platform.

We take performance and security very seriously, which is why we initiated the update to align with SOC 2 Type 2 data security standards. We appreciate your patience and understanding while we worked to resolve the temporary outages.

Our first priority was to get you back up and running; our second is to give you some background on what happened.

Here’s what we determined through our root cause analysis and how we’re preventing this moving forward.

Issue: O365 Sign-in Error 500

Cause: During a routine deployment, one of the third-party software libraries Teem SSO relies on was inadvertently upgraded to the latest major release of that library, which included noticeable changes.

Remediation:

· Reverted the library by specifying previous version with package management to avoid any unintentional upgrades

Issue: Post-Upgrade Device Connectivity

Cause: Teem’s core service had an interruption starting on Jan.9, 2021. During the interruption, when the EventBoard device made API calls to the service, the response status could be one of many error codes, including 401 Unauthorized. While this status code was in error, the device executed its designed security protocols and logged off. During a logout, EventBoard deletes all API tokens, downloaded themes, settings, and calendar data. It then reverts to a not-signed-in state and provides a 6-digit pin code for reactivation. Some customers were stuck on a “Authenticating with Teem …” screen. In these cases, after logging out EventBoard showed a message saying it could not communicate with Teem, and selecting “Retry” locked the app on that screen (a secondary symptom of the core issue). After logging out, the core service interruption would return an error instead of a pin code

Remediation:

· Deployed hotfix allowing devices to automatically activate and log in at pin code screen if they still exist in Teem database and are connected to only one Teem customer instance

· Modified EventBoard app to increase fault tolerance on false 401s and to no longer get stuck at “Authenticating with Teem …” screen

· Modified core service (monolith) so it doesn’t return 401s incorrectly

Issue: App.Teem.com Platform Outage

Cause: When deploying code on Jan. 14, 2021, an errant pip upgrade caused servers to not receive the deploying code and services to be stopped, interrupting all aspects of Teem.

Solutions:

· Updated deploy script to pin pip version

· Cleared salt cache and confirmed correct deploy script on servers

· Changed canary process to better detect downed servers

· Ongoing: Updating underlying framework and all packages

We apologize for any inconvenience and are continually working toward providing a more reliable experience for you.

For additional information or to report an issue, please reach out to your Account Manager or visit help.teem.com to contact our Customer Support team.

Thank you

Posted Jan 29, 2021 - 15:16 MST

Resolved
At this time, our Engineers were able to identify the root cause of the incident and they have implemented a fix. We have verified all systems are operational. This will be monitored, however this incident will be marked resolved at this time. A formal Root Cause Analysis will be posted in the near future.
Posted Jan 14, 2021 - 20:14 MST
Investigating
We have received reports that EventBoard and the Product Portal are not responsive. This is a Severity 1 incident. Our engineering team is aware and an update will be posted at 9 PM MST if not before.
Posted Jan 14, 2021 - 19:43 MST
This incident affected: Web Interface and EventBoard.