Event routing is at the core of what we do at Blues. We take the recent routing delay seriously and we will do better. Our investigation and analysis of this incident has led to multiple improvements to insure none of the root causes can recur.
For events received between 8:39AM EST and 11:49AM EST on Feb 8, routing was delayed by up to 11 hours. No event data was lost. We were able to route all impacted events by 7PM EST that same day.
The delay was caused by an incorrect automated response when unexpected route data was encountered during database maintenance. While we had tested the maintenance procedure prior to production, multiple unrelated issues combined to prevent most event routing during the period.
We first addressed the automation that was prevented routing. This code had been in place for over 2 years, and we had not seen this behavior previously. Now if a similar error were to occur, only the single route in question would be paused.
Once routing was re-established, we started the re-routing process to route all impacted events. We also corrected the source of the unexpected data, and have added guard rails to prevent this from occurring again. We’ve enhanced our internal systems to identify and diagnose such issues faster.
As with any customer-impacting issue, we examine not only the technical root cause, but also our processes and procedures, looking for opportunities to improve.
Customer communication is key during incidents like this. During an incident, we will regularly update status.notehub.io with the latest status. There is a link to this status page on each Notehub.io page. In the future, we will also display an alert banner on Notehub.io to indicating significant service impacts.
To be made aware of customer-impacting events and planned maintenance, please subscribe to alerts at status.notehub.io.