Bugsnag AWS Outage Postmortem
Overview
Date: 2-28-2017
Author: Simon Maynard
Status: Resolved
Summary: An outage in multiple aspects of AWS caused degraded Bugsnag service. Multiple AWS services were affected, preventing Bugsnag from successfully implementing a complete workaround solution.
Impact
- 15 minutes website and data access API downtime.
- Estimated 1 hour of events failed to process at the start of the AWS outage.
- 4 hours worth of events were processed after a delay of up to 16 hours. These delayed events had notifications disabled to prevent customers being notified about errors that had happened hours ago.
- Emails could not be sent by Bugsnag during the outage.
- Sourcemaps, Dsyms and ProGuard files could not be uploaded.
Root Causes: Widespread AWS outage between 09:45am PST and 3:00pm PST, which brought down the Bugsnag services used to process incoming event data. We saw issues with hard drive performance on our main database cluster, Autoscaling was disabled, emails were disrupted, S3 was down, and new machine provisioning was heavily impacted/delayed.
Trigger: Outage in AWS US-EAST-1
Resolution: When Bugsnag’s event queues filled up their buffer as a result of the AWS outage, we rerouted events to large, newly created servers to store a larger buffer until they could be processed. Once the AWS issues were resolved on the AWS side, we immediately prioritised the current events rather than going through the large backlog so that customers could get a view on the current state of their applications. We then started a separate task to process the backlog, and disabled alerts for those events as we didn’t want to trigger alerts for events that happened many hours ago.
Some of the hard drives attached to our main database cluster could not be read from or written to, and our database primary did not failover as a result. This caused a period of website and API access downtime. We manually failed over the database to a different data center and the website and API returned to normal.
Detection: An alert came in indicating that our Docker registry was returning 503 responses at 9:47am PST. Our Docker registry deployment uses S3 as a storage backend for our internal images.
Website downtime was detected by Pingdom.
Lessons Learned
What went well
- Creating new servers successfully prevented machines from filling up and event data from being lost.
- Bugsnag has historically almost always processed all events in real time. We have now made a change such that when queue processing is delayed, the timestamp reflects when the event was received, and not when eventually the event is saved to the database.
What didn’t go so well
- Multiple AWS service outages at once. Bugsnag is set up in a way that, had only one of the multiple services gone down, no degradation would have occurred
- No prior established process for cross-team cooperation for incidents of this magnitude
- Bugsnag failed to process events in real time
- The status page was not accurately updated due to the outage, so we have made adjustments to prevent this in the future
Action items
- Move forward with the already-in-place plans to switch to a different cloud services provider.
- Update incident management process to ensure accurate, timely update to Bugsnag status page.
- Ensure we have enough of a buffer in the queues to deal with a complete halt of our event processing pipeline. This should give us enough time to provision more resources to cope.
Timeline (all times PST)
- 09:45am - AWS outage in US-EAST-1
- 09:47am - Bugsnag alerted of a problem processing event queues, events began not saving
- 10:40am - New servers created to store a much larger buffer of events in the interim
- 10:50am - Events saved again
- 11:00am - Website and API down due to degraded hard drive performance
- 11:15am - Website and API restored
- 03:40pm - New events began processing again, and events sent to Bugsnag during the affected time frame queued for processing
- 8:30am - Event backlog finished processing