Issue with cloud provider
Incident Report for BugSnag
Postmortem

Bugsnag AWS Outage Postmortem

Overview

Date: 2-28-2017

Author: Simon Maynard

Status: Resolved

Summary: An outage in multiple aspects of AWS caused degraded Bugsnag service. Multiple AWS services were affected, preventing Bugsnag from successfully implementing a complete workaround solution.

Impact

  • 15 minutes website and data access API downtime.
  • Estimated 1 hour of events failed to process at the start of the AWS outage.
  • 4 hours worth of events were processed after a delay of up to 16 hours. These delayed events had notifications disabled to prevent customers being notified about errors that had happened hours ago.
  • Emails could not be sent by Bugsnag during the outage.
  • Sourcemaps, Dsyms and ProGuard files could not be uploaded.

Root Causes: Widespread AWS outage between 09:45am PST and 3:00pm PST, which brought down the Bugsnag services used to process incoming event data. We saw issues with hard drive performance on our main database cluster, Autoscaling was disabled, emails were disrupted, S3 was down, and new machine provisioning was heavily impacted/delayed.

Trigger: Outage in AWS US-EAST-1

Resolution: When Bugsnag’s event queues filled up their buffer as a result of the AWS outage, we rerouted events to large, newly created servers to store a larger buffer until they could be processed. Once the AWS issues were resolved on the AWS side, we immediately prioritised the current events rather than going through the large backlog so that customers could get a view on the current state of their applications. We then started a separate task to process the backlog, and disabled alerts for those events as we didn’t want to trigger alerts for events that happened many hours ago.

Some of the hard drives attached to our main database cluster could not be read from or written to, and our database primary did not failover as a result. This caused a period of website and API access downtime. We manually failed over the database to a different data center and the website and API returned to normal.

Detection: An alert came in indicating that our Docker registry was returning 503 responses at 9:47am PST. Our Docker registry deployment uses S3 as a storage backend for our internal images.

Website downtime was detected by Pingdom.

Lessons Learned

What went well

  • Creating new servers successfully prevented machines from filling up and event data from being lost.
  • Bugsnag has historically almost always processed all events in real time. We have now made a change such that when queue processing is delayed, the timestamp reflects when the event was received, and not when eventually the event is saved to the database.

What didn’t go so well

  • Multiple AWS service outages at once. Bugsnag is set up in a way that, had only one of the multiple services gone down, no degradation would have occurred
  • No prior established process for cross-team cooperation for incidents of this magnitude
  • Bugsnag failed to process events in real time
  • The status page was not accurately updated due to the outage, so we have made adjustments to prevent this in the future

Action items

  • Move forward with the already-in-place plans to switch to a different cloud services provider.
  • Update incident management process to ensure accurate, timely update to Bugsnag status page.
  • Ensure we have enough of a buffer in the queues to deal with a complete halt of our event processing pipeline. This should give us enough time to provision more resources to cope.

Timeline (all times PST)

  • 09:45am - AWS outage in US-EAST-1
  • 09:47am - Bugsnag alerted of a problem processing event queues, events began not saving
  • 10:40am - New servers created to store a much larger buffer of events in the interim
  • 10:50am - Events saved again
  • 11:00am - Website and API down due to degraded hard drive performance
  • 11:15am - Website and API restored
  • 03:40pm - New events began processing again, and events sent to Bugsnag during the affected time frame queued for processing
  • 8:30am - Event backlog finished processing
Posted Mar 01, 2017 - 22:43 UTC

Resolved
This incident has been resolved.
Posted Mar 01, 2017 - 00:19 UTC
Monitoring
Bugsnag was affected by the widespread AWS outage that occurred today. Error report processing was interrupted between 9:45am PT and 3:40pm PT. Error report processing has now been restored, and errors are being processed in real time. We're currently working on processing historical events that were queued during the outage. We'll follow up with a full post-mortem soon.

More details about AWS outage: https://status.aws.amazon.com
Posted Mar 01, 2017 - 00:18 UTC
Update
We're recovering from a global outage of our cloud hosting provider AWS, error reports from the past 3 hours could potentially be missing from your dashboard as we catch up on queue processing. You can track the AWS outage here - https://status.aws.amazon.com - and we'll continue to update the status page with more information.
Posted Feb 28, 2017 - 22:09 UTC
Update
An issue with our cloud provider is affecting error processing and parts of the Bugsnag service.
Posted Feb 28, 2017 - 18:29 UTC
Investigating
We are currently investigating this issue.
Posted Feb 28, 2017 - 18:04 UTC