How Atlassian offers resiliency
Our products run in a Platform as a Service (PaaS) environment that is divided into two main infrastructures, which we refer to as micros and non-micros. Jira, Confluence, Statuspage, Access, and Bitbucket run on the Micros platform, while Jira Align, Opsgenie, and Trello run on the non-Micros platform.
To this end, we are working to minimize the impact on customers in the event of interruptions. We use multiple geographically distributed data centers, have a comprehensive backup program, and gain assurance by regularly testing our disaster recovery and business continuity plans.
This page provides an overview of how we manage the entire customer data management lifecycle, including backups using native features on Amazon Web Services (AWS) to ensure availability of our services, how we regularly test our disaster recovery plans and our approach to continuity. improvement of our disaster recovery and business continuity plans.
How we manage backups
First things first: infrastructure and databases
Broadly speaking, Atlassian is divided into two main sets of infrastructures on which our products run: a Platform-as-a-Service (PaaS) environment, known internally as Micros, and a non-Micros environment. Products that run on our Micros platform include Jira, Confluence, Statuspage, Bitbucket, and Atlassian Access, and products that run in non-Micros environments include Opsgenie and Trello. To keep things simple, this document focuses primarily on our largest products: Jira, Confluence, and Bitbucket.
Jira and Confluence Cloud are hosted in multiple AWS Regions leveraging AWS Infrastructure as a Service (IaaS) offerings (specifically US East, US West, Ireland, Frankfurt, Singapore, and Sydney, with plans to expand to other regions later). ). . Jira and Confluence Cloud use logically separate relational databases for each product instance, while attachments stored in Jira or Confluence Cloud are stored on our document storage platform ("Media Platform"), which is ultimately stored on amazon s3.
fuses
Atlassian understands that everything your business does creates data and without your data you don't have a business. In line with our "No #$%! Customer Value" protecting your data against loss is very important to us and we have an extensive backup program.
For Jira and Confluence Cloud, Atlassian uses the Amazon Relational Database Service (RDS) snapshot feature to create automated daily backups of each RDS instance. Amazon RDS snapshots are retained for 30 days with point-in-time restore support and are encrypted with AES-256 encryption.
Note to Jira Align: Amazon RDS snapshots are retained for 35 days.
For Bitbucket, data is replicated to a different AWS Region and separate backups are performed daily within each Region.
Atlassian tests backups for recovery on a quarterly basis, and any issues identified during these tests are created as Jira tickets to ensure all issues are tracked through to resolution.
For more information, see ourFrequently asked questions about data storage.
How we use multiple data centers and availability zones to achieve high availability
Because hurricanes, earthquakes, and tsunamis are remote but not zero hazards, it is imperative that data be backed up (and replicated) across all geographic locations so that data can be recovered no matter what.
Atlassian does this by leveraging AWS highly available data center facilities in various regions of the world. Each AWS Region is a separate geographic location with multiple isolated locations called Availability Zones (AZs). For example, US-West (the west coast of the United States) is a Region where there are two Availability Zones, us-west-1a (in Northern California) and us-west-1b (in Oregon), both in the same region in general, but are geographically isolated.
Each AZ is designed to be isolated from failures in other AZs and provide low-cost, low-latency network connectivity to other AZs in the same region. This multi-zone high availability is the first line of defense and means that services running in Multi-AZ deployments should be able to withstand an AZ outage.
Jira and Confluence use the Multi-AZ deployment mode for Amazon RDS. In a Multi-AZ deployment, Amazon RDS provisions a synchronous standby replica and manages it in another AZ in the same region to provide redundancy and failover capability. AZ failover is automated and typically takes 60-120 seconds, allowing database operations to resume as quickly as possible without administrator intervention. These concepts of region, availability zone, and replication are highlighted in the following diagrams. Opsgenie, Statuspage, Trello, and Jira Align use similar deployment strategies with slight differences in replication time and failover time.
How we determine recovery time and recovery point objectives
In an ideal world, we would never lose important business data. In practice, however, a system without risk of data loss is either unachievable or prohibitively expensive. While Atlassian has had a cultural expectation of this scenario of zero data loss and the ability to automatically survive an Availability Zone failure, business continuity planning requires setting "recovery time objectives" and "recovery point objectives" (RTO or RPO) trying to find the right balance between costs, benefits and risks.
RTO is the period after an incident in which the business process (or system) must recover and be operational again. The RPO is effectively the amount of data that the organization accepts can be lost in a recovery operation. In a simple example, if you do daily backups, have an incident at the end of the day, and recover from the backup (which was done yesterday), you will lose data for one day. This is the RPO.
Our business impact and risk assessments help our teams set customized RTO and RPO targets based on client user requirements and the potential impact of a disruption.
More specifically, we divide our services into easy-to-understand groups that we call tiers. Three tiers are defined for customer-facing products and services, Atlassian business systems, and internal tools (tiers 1, 2, and 3), and an underlying tier (tier 0) provides an even higher standard of availability for critical components that underpin all depends.
For each stage, we have defined binding objectives, among other things, by reviewing business impact assessments and typical usage scenarios for the services we create. Our service levels help determine availability, reliability, RTO and RPO objectives as detailed in the table below.
level 0 | paso 1 | rang 2 | level 3 | |
---|---|---|---|---|
Critical infrastructure and service components | Our Level 0 services are the foundation of all other services and are critical to the delivery of our products. | Our Tier 1 Services are generally our products or are directly related to the delivery of our products. | Tier 2 services are non-critical or internally focused. | Tier 3 services are non-critical or internally focused. |
Sample services: | sample services · Platform AWS Micros-Server · Núcleo de red | sample services Jira and Confluence Cloud Drill bucket Jira Guidance · Trello · Genius | sample services Image effect · CAC | sample services Obtaining analytical and/or BI data |
RPO* | <1 hour | <1 hour | <8 hours | <24 hours |
RTO** | <4 hours | <6 hours | <24 hours | <72 hours |
*RPO: Recovery Point Objective: Data Loss in the Event of a Disaster
**RTO - Recovery Time Objective - Restoration of services in the event of a disaster
At Atlassian, we hold service owners accountable for ensuring that the relevant service meets its RPO and RTO target.
How we test disaster recovery
Atlassian performs regular disaster recovery testing and strives for continuous improvement through our Disaster Recovery (DR) program. This is to ensure that customer data and services are reliable and resilient. We perform scheduled and ad hoc tests, including the following items:
documentation- For customer-facing/critical services (including Tier 0 and Tier 1), quarterly reviews of supporting documentation are conducted for accuracy and completeness/timeliness. All identified issues are documented and result in an internal Jira ticket, so the issue can be traced back to resolution.
Minutes- Quarterly testing of actual technical backup/restore processes for critical/customer-facing services (including Tier 0 and Tier 1) is also performed to determine if RTO and RPO targets (based on Tier classification) are being met. of service). All issues identified as a result of these tests are created as a Jira ticket to track the issue until it is resolved.
Resiliency and failover- Periodic and ad hoc AZ resiliency testing is performed to ensure that Atlassian can handle an AZ failure with minimal downtime. While we understand that a complete region failure is highly unlikely, we also regularly test region failovers and continue to build our regional resiliency.
systems- Site reliability engineering (SRE) teams and product development teams continually monitor a variety of metrics on the Services to ensure that users have excellent experiences. Automatic alerts are set up to notify SRE team members when certain service metric thresholds are exceeded, allowing immediate action to be taken within our incident response processes.
Disaster Recovery Dashboard- A DR dashboard is maintained internally so that Jira tickets related to monitoring, maintaining, and testing critical or customer-facing services (including levels 0 and 1) can be centrally tracked to ensure that the Documentation reviews and backup/recovery processes are completed on time. .
DR tests and simulations– DR tests are performed annually and on an ad hoc basis. As part of our DR testing, we run simulation exercises to help DR teams run through different potential incident scenarios. Tabletop exercises test different scenarios and identify gaps in our recovery processes. Bench exercise scenarios include earthquakes, fires, natural disasters, recovery exercises, and tests. After DR testing is performed, the test results are collected, analyzed, and discussed to determine the scope of the next continuous improvement steps. Improvement efforts are captured in a Jira ticket and tracked to resolution.
Atlassian recognizes that while our testing and processes are technically rigorous, we still set the standard for having exceptional people bring it all together. Accordingly, Atlassian includes the following person elements in our DR program:
Site Reliability Engineers („SRE“)– SREs commit to regular DR meetings and represent their critical services. You will identify DR gaps with our risk and compliance team and focus on remediation as needed.
Master of Disaster Recovery- Within each product/service team (including underlying services), DR Champions are appointed to oversee and manage the DR implementation within that product/service to ensure it meets service layer requirements.
guide- We maintain the participation and ongoing commitment of executives and senior management in our DR processes. With leadership involved, Atlassian considers business and technical factors in its resiliency strategy.
Other broader business continuity measures and plans
Atlassian strives to maintain strong Business Continuity ("BC") and DR capabilities to ensure that the impact to our customers is minimized in the event of a business interruption. The key tenets of our BC and DR program include:
continuous improvement– Atlassian strives to improve resiliency through operational efficiencies, automation, new technologies, and best practices.
Security through testing– Atlassian recognizes that through regularly scheduled testing and continuous improvement, we can achieve optimal resiliency.
dedicated resources- Atlassian has dedicated individuals and teams to ensure our customer-focused products receive the attention they need to enable BC and DR. Atlassian has the right resources in place to support our steering committee, risk assessments, business impact analysis testing, and of course, real world incidents.
in summary
Atlassian combines best-in-class technologies with ongoing testing and validation to ensure our customers' data is highly available, trusted, and resilient. We operate several geographically dispersed data centers, have an extensive backup program, and obtain security through regular disaster recovery testing and business continuity plans. To top it all off, we have exceptional people and dedicated resources bringing our processes together.
I want to go deeper
- Trust @Atlassian
- Security at Atlassian
- Atlassian Security Practices
- Atlassian architecture and operational practices
- Compliance at Atlassian
- SOC 2 Reports (System and Organizational Controls).
- ISO/IEC 27001 and ISO/IEC 27018 certified
- status page