Incident: A change to how we load data from our databases led to increased server memory usage. During a period of heavy traffic the increased memory pressure exceeded a critical threshold. This caused some servers to become overloaded, which resulted in slow or unresponsive request handling and retries. We reverted to a prior deployment and observed system health recover.
Impact: Between 13:47 and 15:20 UTC on 2024-03-07, attempts to create or edit data in Asana were delayed up to a few minutes, about 1% of web sessions crashed, and some API requests failed or were delayed. Background actions such as automations and email notifications were also delayed.
Moving forward: We have identified and reverted the problematic change and are improving our memory usage monitoring to identify regressions more quickly before causing user impact. Additionally, we discovered an issue with a safety measure to prevent excessive memory usage that we will address to prevent similar memory pressure issues in the future.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.