Incident: Starting at 2025-02-05 21:05 UTC, a configuration change caused a large increase in server logs, overloading logging infrastructure and causing server restarts. This caused cascade failure as other servers were overloaded. The configuration change was reverted, and our application recovered after the servers restarted, with recovery by around 2025-02-05 21:30.
Impact: Users saw errors loading our application and loading data within the application. No customer data was lost.
Moving forward: We have removed the source of logs which resulted in overload, and are improving monitoring and resilience of the system which triggered failure. We will also be moving to staged configuration rollouts to help reduce or avoid customer impact for similar issues.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.