Incident: On 2024-02-07, a deployment of new code introduced a performance regression for automations and background actions. In our European data center during peak traffic from 03:56 to 04:54 UTC, our background job system hit autoscaling safety limits and did not keep up with incoming work due to the performance regression. We reverted this deployment and increased our scaling limits to speed time to recovery.
Impact: During the incident, customers in our European data center experienced delayed background actions, such as rules, email notifications, imports and exports. No customer data was lost.
Moving forward: We will improve our monitoring and playbooks to more quickly identify regressions like this and improve our ability to promptly respond. We have identified the root cause of this particular performance regression and are resolving the issue.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.