Incident: Around 2023-02-28 20:31 UTC servers responsible for reactivity started to fail due to unexpected input. After reverting the change triggering this, most recovered between 21:30 and 22:00. In some cases stale data was displayed until caches were cleared, which finished at 23:05. Approximately 1% of users continued to see reactivity failures and stale data until around 2023-03-01 00:04 UTC.
Impact: While reactivity servers were down, API writes failed and changes were not reflected to other tabs. After recovery of reactivity servers, in some cases stale data was displayed within our applications until the caches were fully cleared. No customer data was lost.
Moving forward: We are making changes to the application servers which crashed to make them more resilient against unexpected input, and making tooling changes to reduce time to resolution for this class of incident. Architectural changes which are in progress will provide smaller failure domains, which would reduce impact and provide faster resolution for this class of failure. We use the 5 Whys approach to identify technical, operational, and organizational changes to reduce the likelihood and severity of incidents.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.