Creative object management and creative audit temporarily unavailable
Incident Report for Xandr
Postmortem

Incident Summary

From 17:13 UTC on Thursday, July 8th to 21:24 UTC on Friday, July 9th and from 19:38 to 23:13 UTC on Monday, July 19th our streaming data service cluster in the NYM datacenter crashed due to a faulty configuration being introduced. Efforts to reconnect to the cluster and remediate the issue were unsuccessful due to the volume of concurrent reconnection requests.

Scope of Impact

During the incident window, customer may have experienced some or all of the following:

  • Reporting Data Delays. During the incident window Deal metrics reporting, Seller Monitoring Workflow tables, Deals screen data, and Real-time reporting all experienced delays.
  • Budgeting Issues. During the incident window client buy-side objects may have experienced overspend or underspend.
  • Discovery Issues. During the incident window Line Items leveraging Discovery may have underdelivered during the incident window.
  • Creative Audit Delays. During and after the incident window clients may have experienced creative audit delays.
  • Object Creation and Edit Errors. During the incident window clients may have been unable to create or edit objects (eg. line item, placement) via the UI or API.
  • Batch Segment Service Delays. During the incident window clients may have experienced delays with the batch segment service.
  • Universal Pixel Errors. During the incident window Universal Pixel conversions and segments experienced data loss.

Timeline (UTC)

2021-07-08 17:13: Incident Started

2021-07-08 17:38: Incident Ticket Created

2021-07-08 19:32: Incident Ticket Escalated

2021-07-08 20:44: First attempt to execute a configurational change to the data streaming service

2021-07-08 21:02: Second attempt to execute a configurational change to the data streaming service

2021-07-08 22:36: Third attempt to execute a configurational change to the data streaming service

2021-07-08 23:16: Fourth attempt to execute a configurational change to the data streaming service

2021-07-09 03:26: Traffic filtered to mitigate surge of reconnection requests to data streaming service

2021-07-09 21:24: Incident Resolved: Whitelisting of all applications completed.

2021-07-19 19:38: Incident Re-opened

2021-07-19 19:57: Engineering recovers and brings data streaming service back online

2021-07-19 23:13: Incident Resolved: Streaming service running and serving all client requests.

Cause Analysis

The root cause of the incident was due to the failure of our data streaming service cluster in the NYM datacenter, which crashed due to a faulty configuration being introduced. Efforts to reconnect to the cluster and remediate the issue were unsuccessful due to the volume of concurrent reconnection requests.

Resolution Steps

Our engineers resolved the issue by preventing a surge of concurrent reconnection requests through a whitelisting process enabling small pools of clients to connect to the data streaming service cluster.

Next Step(s)

  • Conduct a follow up investigation to determine instigating events.
  • Enhance data streaming service metrics dashboard to provide visibility into monitoring and alerting.
  • Improve data streaming service cluster restart time.
  • Reevaluate data streaming service configuration.
Posted Jul 26, 2021 - 16:45 UTC

Resolved

The incident has been fully resolved. We apologize for the inconvenience this issue may have caused, and thank you for your continued support.

Posted Jul 20, 2021 - 01:18 UTC
Monitoring

We have patched the issue and are monitoring our systems closely. We will provide an update as soon as the issue has been fully resolved.

Posted Jul 20, 2021 - 01:14 UTC
Identified

We have identified the following issue:

  • Component(s): Creative uploads, Creative audit, API
  • Impact(s):
    • Unable to upload or preview creatives
    • Latency, timeouts and errors in API
    • Delays in creative audit times. Please ensure that your creatives are over SLA before submitting a ticket to our support team. If you are in need of an urgent audit, we encourage you to use our priority audit option
  • Severity: Minor Outage
  • Datacenter(s): NYM2

Our engineers are actively working towards a resolution, and we will provide an update as soon as possible. Thank you for your patience.

Posted Jul 19, 2021 - 21:21 UTC