Domain 3 β€” Module 1 of 5 20%
14 of 30 overall
Domain 3: Design Business Continuity Solutions Free ⏱ ~18 min read

Recovery Objectives: RPO, RTO & SLA

Before designing backup or DR, you need to know the targets. RPO, RTO, and SLA are the numbers that drive every business continuity architecture decision.

Understanding recovery objectives

Simple explanation

Imagine your house floods. Two questions: How much stuff can you afford to lose? (RPO β€” Recovery Point Objective) and How quickly do you need to be back in a liveable house? (RTO β€” Recovery Time Objective).

RPO = how much data loss is acceptable. RPO of 1 hour means you can lose up to 1 hour of data. RPO of 0 means zero data loss.

RTO = how long downtime is acceptable. RTO of 4 hours means the service must be back within 4 hours. RTO of 0 means no downtime.

SLA = the availability percentage Azure guarantees. 99.99% = 4.38 minutes of downtime per month.

RPO and RTO in practice

RPO TargetWhat It MeansTechnology RequiredCost
0 (zero data loss)Every transaction must survive a failureSynchronous replication (same-region AZ)Highest
< 5 minutesNear-real-time replicationAsynchronous replication (geo-replication)High
< 1 hourRecent state preservedFrequent automated backupsMedium
< 24 hoursYesterday’s data preservedDaily backupsLow
< 7 daysWeekly checkpointWeekly backups with retentionLowest
RTO TargetWhat It MeansTechnology RequiredCost
0 (no downtime)Instant failover, always activeActive-active deployment, multi-regionHighest
< 15 minutesAutomated failoverHot standby, auto-failover groupsHigh
< 4 hoursManual failover with prepared runbookWarm standby, Azure Site RecoveryMedium
< 24 hoursRestore from backupBackup + restore procedures, cold standbyLow
< 72 hoursExtended recovery acceptableArchive restore, rebuild from scratchLowest

🏦 Elena’s recovery tiers: FinSecure Bank classifies workloads into tiers:

TierWorkloadRPORTOSolution
Tier 1 (Critical)Trading platform00Active-active, synchronous replication
Tier 2 (Important)Customer portal5 min15 minGeo-replication, auto-failover
Tier 3 (Standard)Internal reporting1 hour4 hoursAzure Backup, ASR warm standby
Tier 4 (Non-critical)Dev/test environments24 hours24 hoursDaily backup, rebuild from IaC
Exam tip: Not everything needs Tier 1 protection

A common exam trap: designing maximum protection for everything. The correct answer considers business impact and cost. If the scenario says β€œinternal reporting dashboard” β€” it doesn’t need active-active multi-region deployment. Match the investment to the impact of downtime.

Composite SLA calculation

When your architecture uses multiple Azure services, the composite SLA is the product of individual SLAs:

ServiceIndividual SLA
Azure App Service99.95%
Azure SQL Database99.99%
Azure Storage (GRS)99.99%

Composite SLA = 0.9995 x 0.9999 x 0.9999 = 99.93% (about 30 minutes downtime/month)

Improving composite SLA

TechniqueHow It Helps
Availability ZonesSurvives data centre failure β€” increases individual service SLA
Multi-region deploymentSurvives regional failure β€” can achieve 99.999%+
Redundant pathsIf component A fails, component B handles traffic
Queue-based decouplingServices communicate asynchronously β€” one failure doesn’t cascade

πŸ—οΈ Priya’s SLA design: GlobalTech needs 99.99% for their customer portal. A single-region deployment gives 99.93%. Priya added:

  • Multi-region App Service with Traffic Manager failover
  • SQL auto-failover group (secondary in paired region)
  • Result: individual failures don’t cause customer-visible downtime
Well-Architected Framework connection

Reliability pillar is entirely about meeting recovery objectives:

  • Design for failure at every layer
  • Quantify RPO/RTO/SLA before choosing technologies
  • Test recovery procedures regularly (chaos engineering)
  • Balance availability investment with business value

Cost Optimisation: Every additional β€œnine” of availability costs significantly more. 99.9% β†’ 99.99% might double your infrastructure cost. The architect must justify each level.

Backup vs DR vs HA β€” the three pillars

ConceptPurposeScopeExample
BackupRecover from data loss or corruptionData recovery (RPO)Azure Backup restoring a VM from yesterday
Disaster Recovery (DR)Resume operations after regional/site failureService recovery (RPO + RTO)Azure Site Recovery failing over to secondary region
High Availability (HA)Prevent downtime from component failuresContinuous operation (uptime)Availability Zones β€” app survives data centre failure

Critical distinction: Backup is NOT DR. DR is NOT HA. A backup protects your data but doesn’t keep the service running. DR gets you running again after a failure. HA prevents the failure from being visible to users. You need all three β€” designed to match your recovery objectives.

Knowledge check

Question

What does RPO measure?

Click or press Enter to reveal answer

Answer

Recovery Point Objective β€” the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can lose up to 1 hour of data. RPO of 0 means zero data loss (requires synchronous replication). Lower RPO = more frequent backups or real-time replication = higher cost.

Click to flip back

Question

How do you calculate composite SLA for a multi-service architecture?

Click or press Enter to reveal answer

Answer

Multiply the individual SLAs together. Example: 99.95% x 99.99% x 99.99% = 99.93%. Each additional service in the chain REDUCES the composite SLA. To improve it: add redundancy (multi-region), use availability zones, or decouple with queues.

Click to flip back

Question

What's the difference between backup, DR, and HA?

Click or press Enter to reveal answer

Answer

Backup: recovers DATA after loss/corruption (answers RPO). DR: resumes SERVICE after site/region failure (answers RPO + RTO). HA: prevents downtime from component failures (answers uptime SLA). You need all three β€” backup alone doesn't keep services running, and HA alone doesn't protect against data corruption.

Click to flip back

Knowledge Check

🏦 FinSecure Bank's trading platform requires zero data loss and zero downtime. Their internal reporting dashboard can tolerate up to 4 hours of downtime and 1 hour of data loss. What recovery tiers should Elena assign?

Knowledge Check

πŸ—οΈ Priya's customer portal uses App Service (99.95%), Azure SQL Database (99.99%), and Blob Storage (99.99%). The business requires 99.99% availability. The current composite SLA is 99.93%. What should Priya add to meet the target?


Next up: Recovery targets are set β€” now let’s design the backup solution β€” Backup & Recovery for Compute.