Incident: Around 15:50 UTC, a system used for caching entered an overload state due to network connection tracking within our hosting environment. Connections from servers to this cache timed out, and request queuing created a cascading failure. Reducing load via application changes allowed recovery, and we then made configuration changes to provide additional capacity.
Impact: Asana was unavailable in all regions for as long as 16 minutes, with a partial outage for 6 additional minutes.
Moving forward: We've updated configuration to avoid overload, and are adding monitoring to detect this type of saturation before failure.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.