On 06/Oct/2020 between 10:58 AM - 02:42 PM UTC customers using cloud versions of Jira Software, Jira Service Management, and Jira Core products were unable to use their Jira instances due to an elevated number of 5XX errors. The event was triggered by a loss of connectivity to one our database servers on AWS. The loss of connectivity impacted a limited number of European based customers . The incident was detected within 129 minutes by our automated monitoring system and mitigated by replacing Jira EC2 instances (both web server and background processing nodes) which put our systems into a known good state. The total time to resolution was just about 3 hours & 44 minutes. Incident history can be tracked at https://jira-service-management.status.atlassian.com/incidents/1c8nqj1p9zdq
TECHNICAL REASONS
The issue was caused by a forced database failover because the previous ‘primary’ host was not responding. After the failover, Jira was not able to obtain new database connections from the connection pool. As a result all the above products could not make any connection to the underlaying database server and their users received HTTP errors.
ROOT CAUSE
More specifically our connection pool implementation had a bug that could prevent clients from obtaining new database connection in very particular situations such as a forced failover.
To prevent such incidents in the future, we will do the following:
While we have a number of monitoring logs and alerts in place, this specific issue wasn’t identified as early as we had wished for because we were missing specific metrics and alerts.
Additionally, we have introduced changes meant to:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support