Incident: At 2023-12-14 23:40 UTC, an engineer made a configuration change that caused downtime. We reverted, but determined that some application servers got into an unrecoverable bad state. Affected servers were identified and restarted; to avoid overload, traffic was allowed to return incrementally as the servers became ready. Background actions were delayed starting at 2023-12-15 00:22 to aid recovery.
Impact: This resulted in full application and API downtime across all regions. For the US region, application downtime was resolved for most users between 2023-12-15 00:13 and 00:30. For EU, Japan, and Australia, resolution started around 00:40. API traffic was restored between 00:36 and 00:43. Application and API availability was completely restored for all users by 2023-12-15 00:49. Background actions were delayed from 2023-12-15 00:22 through 01:08, with complete resolution by 01:30.
Moving forward: This incident uncovered a bug preventing automatic restarts of our backend servers. We have action items to fix this bug, improve validation and config handling, and enhance observability and operational tools. In the medium term, we are working on creating more independent failure domains to limit the scope of unexpected downtime. We will follow our 5 Whys process to identify other potential action items.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.