Delayed writes and API performance issues
Incident Report for Asana
Postmortem

Incident

On 2023-12-20, an engineer made a configuration change that caused a large amount of unexpected logging. The following day during peak hours, at 2023-12-21 at 14:13 UTC, the excess logging caused resource saturation that led to slow requests and timeouts. The configuration change was reverted, and related infrastructure was scaled up to aid recovery. After initial recovery, some users experienced a second period of degraded performance due to  a retry mechanism that confirms event logs are recorded.

Impact

On 2023-12-21, between 14:13 UTC and 14:18 UTC, about 20% of Asana users in the US region experienced forced reloads, and existing sessions experienced delayed writes and API performance issues. Between 15:35 UTC and 15:45 UTC, about 5% of Asana users experienced delayed writes and API performance issues. Application uptime and performance was fully restored by 2023-12-21 15:46 UTC. No data was lost. 

Moving Forward

The incident uncovered a gap in how we monitor logging volume and its effects on server performance during high traffic times. Moving forward, we have action items to improve the resiliency of our infrastructure to unexpected increases in logging volume, as well as to ensure our monitoring will enable us to proactively detect and prevent similar issues in the future.

Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.

Posted Dec 22, 2023 - 23:32 UTC

Resolved
We are currently investigating this issue.
Posted Dec 20, 2023 - 22:13 UTC