Jira Outage
Incident Report for Jira Service Management
Postmortem

SUMMARY

On 06/Oct/2020 between 10:58 AM - 02:42 PM UTC customers using cloud versions of Jira Software, Jira Service Management, and Jira Core products were unable to use their Jira instances due to an elevated number of 5XX errors. The event was triggered by a loss of connectivity to one our database servers on AWS. The loss of connectivity impacted a limited number of European based customers . The incident was detected within 129 minutes by our automated monitoring system and mitigated by replacing Jira EC2 instances (both web server and background processing nodes) which put our systems into a known good state. The total time to resolution was just about 3 hours & 44 minutes. Incident history can be tracked at https://jira-service-management.status.atlassian.com/incidents/1c8nqj1p9zdq

TECHNICAL REASONS

The issue was caused by a forced database failover because the previous ‘primary’ host was not responding. After the failover, Jira was not able to obtain new database connections from the connection pool. As a result all the above products could not make any connection to the underlaying database server and their users received HTTP errors.

ROOT CAUSE

More specifically our connection pool implementation had a bug that could prevent clients from obtaining new database connection in very particular situations such as a forced failover.

REMEDIAL ACTIONS PLAN & NEXT STEPS

To prevent such incidents in the future, we will do the following:

  • Fix the connection pool bug that was preventing clients from obtaining a new database connection
  • Conclude tests (as known as Wargames) that could simulate an RDS failover due to issue with the underlying hardware.

While we have a number of monitoring logs and alerts in place, this specific issue wasn’t identified as early as we had wished for because we were missing specific metrics and alerts.

  • Create a detector monitoring very high error rates over short periods of time
  • Add metrics for other failures when attempting to acquire the connection from the database server
  • Log IP information when not able to obtain database connection for a particular database server

Additionally, we have introduced changes meant to:

  • Prepare a JSRE (Jira Site Reliability Engineering) runbook to debug connection DNS issues.
  • Monitor impacts and rollout DNS cache setting changes
  • Improve both the JVM and Linux kernel DNS refresh settings.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jan 28, 2021 - 12:23 UTC

Resolved
Sometime around 2020-10-06 10:54 UTC, we experienced a temporary outage for some Jira Service Desk Cloud customers. The issue has been resolved and the service is operating normally.
Posted Oct 06, 2020 - 15:40 UTC
Monitoring
We have identified the root cause of the temporary outage and have mitigated the problem. We are now monitoring closely.
Posted Oct 06, 2020 - 14:53 UTC
Investigating
We are investigating an issue where the application is not accessible that is impacting some Jira Service Desk Cloud customers. We will provide more details within the next hour.
Posted Oct 06, 2020 - 13:43 UTC