Incident: Around 2024-06-06 21:41 UTC a software deployment started sending traffic to 3 regions (AU, JP, EU) which triggered a version incompatibility bug for subsequent new sessions. The rate of errors increased after a few minutes, and our engineers were alerted due these errors at 22:07 UTC. After investigation we rolled back the new version, with rollback completing around 23:08 UTC.
Impact: Affected users who initiated new sessions during the timeframe saw an elevated rate of crashes. Existing sessions from before this timeframe were not affected.
Moving forward: In the short term, we are changing how we deploy software so that all global regions will get software deployments at the same time, with validation; that could have prevented this incident. We will also update our incident response training and monitoring to improve response time. In the long term, we have a project to ensure that client and server versions are consistent for the software which encountered this incident’s error, which would allow us to avoid this class of issues.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.