TaxJar App, API Outage
Incident Report for TaxJar
Postmortem

During this incident, TaxJar customers were not able to access the TaxJar App or use the TaxJar API. We know this was impactful, and we are truly sorry it happened. 

We have already implemented the following operational changes to ensure this type of failure does not happen again:

  • We updated our deployment pattern to a blue-green deployment pattern to allow us to better verify changes to production environments.
  • We are conducting a full audit of our vendor provided managed services that lack the acceptable level of rollback capabilities

Incident Root Cause Analysis 

  • The incident started with a routine Kubernetes minor version upgrade using our vendor’s managed kubernetes service

    • This is a routine upgrade operation that we’ve completed 15 times in the past across 3 accounts and 2 regions. We perform this upgrade quarterly in order to keep pace with Kubernetes releases.
  • Immediately following completion of the upgrade of our production cluster, Kubernetes workers began reporting “Not Ready” status.

  • Within a few minutes all nodes were now in a state of “Not Ready” which caused all workloads to be marked as offline by our load balancers.

    • Kubernetes upgrades on our vendor’s managed Kubernetes service are not able to be rolled back. Furthermore new deployments and upgrades to the managed Kubernetes service can take 30-50 minutes to complete, leaving us forced to resolve the immediate issue rather than rolling back.
  • The vendor’s support team was able to identify the issue:

    • Clusters, starting with Kubernetes version 1.14 create a cluster security group when they are created. 
    • This security group is designed to allow all traffic from the control plane and managed node groups to flow freely between each other. 
    • After the upgrade was completed, the vendor identified that the security group no longer had the required rules configured to allow this traffic to pass, even though this has always happened in prior instances of minor version upgrades.
  • We manually added the missing rule, which restored connectivity to our managed Kubernetes cluster.

  • At this point our services started coming back online.

  • Several other security groups, managed with cloudformation, which had utilized this rule for connectivity between our K8s workloads to other services provided by the vendor (such as memory caches and databases) were identified as being unexpectedly altered after this upgrade and also had to be repaired before all services could be restored.

  • We continue to work with the vendor to understand the root cause for the failure of the managed service to not operate as documented.

Posted Jan 14, 2021 - 00:06 EST

Resolved
This incident has been resolved.
Posted Jan 11, 2021 - 15:00 EST
Update
We are continuing to monitor for any further issues.
Posted Jan 11, 2021 - 14:50 EST
Update
We are continuing to monitor for any further issues.
Posted Jan 11, 2021 - 14:34 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 11, 2021 - 14:33 EST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 11, 2021 - 14:30 EST
Update
We are continuing to investigate this issue.
Posted Jan 11, 2021 - 14:07 EST
Update
We are continuing to investigate this issue.
Posted Jan 11, 2021 - 13:36 EST
Investigating
We are currently investigating this issue.
Posted Jan 11, 2021 - 13:30 EST
This incident affected: TaxJar Reporting and TaxJar API (Tax Calculations API, Tax Rates API, Transaction Push API, Other API Services).