Incident: An off-peak operational task was suspended part-way through due to an error in our database upgrade script. This left our system in a configuration where caching was unavailable for some components, resulting in degraded performance during regional peak traffic. Additional (unrelated) errors in the Amazon Web Services availability zone for the region increased time to resolution. The net effect was that all EU traffic was unintentionally slowed down for approximately 10 hours.
Impact: At times Asana was largely unavailable to new logins, with only existing sessions able to be used. Some of these existing sessions experienced degraded performance. For a period of time Asana was fully inaccessible. No customer data was lost.
Moving forward: This incident involved interaction across teams and systems. Our 5 Whys analysis of this event identified operational, organizational, and service changes to reduce the likelihood of incidents -- and decrease the time to resolution. We've recently staffed a Production Engineering team whose goal is to improve processes and systems across teams.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.