Incident: On 2024-10-10, between 01:23 and 02:31 UTC a software change caused overload to a shared resource, leading to application errors. The change was rolled back, with recovery for most users around 02:31 and full recovery by 02:39 UTC.
Impact: Users were unable to start new sessions on the web and desktop apps, though mobile and existing sessions were unaffected. No data was lost.
Next Steps: We will conduct a retrospective using the 5 Whys method to improve monitoring, tooling, and playbooks. We will also update our software to manage load more efficiently and reduce reliance on globally shared resources to prevent similar issues in the future.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.