Event Routing Delay
Incident Report for Blues
Postmortem

Summary

Event routing is at the core of what we do at Blues. We take the recent routing delay seriously and we will do better. Our investigation and analysis of this incident has led to multiple improvements to insure none of the root causes can recur.

Impact

For events received between 8:39AM EST and 11:49AM EST on Feb 8, routing was delayed by up to 11 hours. No event data was lost. We were able to route all impacted events by 7PM EST that same day.

Root Cause

The delay was caused by an incorrect automated response when unexpected route data was encountered during database maintenance. While we had tested the maintenance procedure prior to production, multiple unrelated issues combined to prevent most event routing during the period.

Mitigations

We first addressed the automation that was prevented routing. This code had been in place for over 2 years, and we had not seen this behavior previously. Now if a similar error were to occur, only the single route in question would be paused.

Once routing was re-established, we started the re-routing process to route all impacted events. We also corrected the source of the unexpected data, and have added guard rails to prevent this from occurring again. We’ve enhanced our internal systems to identify and diagnose such issues faster.

As with any customer-impacting issue, we examine not only the technical root cause, but also our processes and procedures, looking for opportunities to improve.

Stay In Touch

Customer communication is key during incidents like this. During an incident, we will regularly update status.notehub.io with the latest status. There is a link to this status page on each Notehub.io page. In the future, we will also display an alert banner on Notehub.io to indicating significant service impacts.

To be made aware of customer-impacting events and planned maintenance, please subscribe to alerts at status.notehub.io.

Posted Feb 10, 2023 - 00:36 EST

Resolved
The incident is fully resolved and event routing is taking place normally.
Over the course of the next 24 hours, we will automatically route any events that failed to route initially. Because of this, you may see events arrive at your routing endpoints with significant delays.

*Update*: All events had completed routing as of 7PM EST.

Timeline: A partial event routing outage began at 8:39AM EST. Between 8:47AM EST and 11:38AM EST no events were routed. At that point, event routing began to take place and was fully restored by 11:49AM EST.

Throughout this period, events were still persisted to notehub. You can view these events at notehub.io.
Posted Feb 08, 2023 - 14:04 EST
Monitoring
The mitigation appears to have fixed the issue.
Routing is operating normally now.
We will continue to monitor to ensure there are no remaining issues.
Posted Feb 08, 2023 - 11:35 EST
Update
The mitigation appears to have fixed the issue.
Routing is operating normally now.
We will continue to monitor to ensure there are no remaining issues.
Posted Feb 08, 2023 - 11:34 EST
Update
We believe we have identified the issue preventing routing from working.
We are deploying a mitigation now.
Posted Feb 08, 2023 - 11:30 EST
Investigating
Routing is currently not available on Notehub.
We are investigating. Next update by 11:30AM EST.
Posted Feb 08, 2023 - 10:50 EST
This incident affected: Notehub API, Cellular Notehub Handlers, notehub.io UI, and Wifi Notehub Handlers.