Incident: From 2023-03-23 13:30 UTC until 2023-03-23 15:10 UTC all databases responsible for storing data related to Asana users were periodically failing over to their standby instance. We run these databases using AWS RDS Multi-AZ, and AWS triggers these failovers based on health signals they are monitoring. We are still working with AWS to investigate and determine the full root cause but our current best theory is that this was caused by elevated DNS lookup latency resulting in database connection saturation. The increase in DNS lookup latency caused connections to the database to pile-up eventually overloading the databases triggering the automated RDS failover system. We recovered after deploying a configuration change to disable DNS hostname lookups.
Impact: During the incident, users of the web app were unable to load or use Asana for 1 to 5 minutes while the database was unavailable during failover. Users may have experienced these periodic outages multiple times during the 100 minute event. API users experienced a similar downtime impact.
Moving forward: We’re working with the AWS RDS team to fully understand the root cause of the issue as well as making some configuration changes such as disabling DNS hostname lookups on our databases so that we will no longer be as susceptible to increased DNS latency. We believe our database use case does not need to resolve client hostnames. In the long-term, we’ll be revisiting our database connection strategy to make more use of connection pooling, which will reduce our dependence on establishing and managing quick and short-lived database connections.