Recovery Objectives: RPO, RTO & SLA

Understanding recovery objectives

Simple explanation

Imagine your house floods. Two questions: How much stuff can you afford to lose? (RPO — Recovery Point Objective) and How quickly do you need to be back in a liveable house? (RTO — Recovery Time Objective).

RPO = how much data loss is acceptable. RPO of 1 hour means you can lose up to 1 hour of data. RPO of 0 means zero data loss.

RTO = how long downtime is acceptable. RTO of 4 hours means the service must be back within 4 hours. RTO of 0 means no downtime.

SLA = the availability percentage Azure guarantees. 99.99% = 4.38 minutes of downtime per month.

RPO and RTO in practice

RPO Target	What It Means	Technology Required	Cost
0 (zero data loss)	Every transaction must survive a failure	Synchronous replication (same-region AZ)	Highest
< 5 minutes	Near-real-time replication	Asynchronous replication (geo-replication)	High
< 1 hour	Recent state preserved	Frequent automated backups	Medium
< 24 hours	Yesterday’s data preserved	Daily backups	Low
< 7 days	Weekly checkpoint	Weekly backups with retention	Lowest

RTO Target	What It Means	Technology Required	Cost
0 (no downtime)	Instant failover, always active	Active-active deployment, multi-region	Highest
< 15 minutes	Automated failover	Hot standby, auto-failover groups	High
< 4 hours	Manual failover with prepared runbook	Warm standby, Azure Site Recovery	Medium
< 24 hours	Restore from backup	Backup + restore procedures, cold standby	Low
< 72 hours	Extended recovery acceptable	Archive restore, rebuild from scratch	Lowest

🏦 Elena’s recovery tiers: FinSecure Bank classifies workloads into tiers:

Tier	Workload	RPO	RTO	Solution
Tier 1 (Critical)	Trading platform	0	0	Active-active, synchronous replication
Tier 2 (Important)	Customer portal	5 min	15 min	Geo-replication, auto-failover
Tier 3 (Standard)	Internal reporting	1 hour	4 hours	Azure Backup, ASR warm standby
Tier 4 (Non-critical)	Dev/test environments	24 hours	24 hours	Daily backup, rebuild from IaC

Exam tip: Not everything needs Tier 1 protection

A common exam trap: designing maximum protection for everything. The correct answer considers business impact and cost. If the scenario says “internal reporting dashboard” — it doesn’t need active-active multi-region deployment. Match the investment to the impact of downtime.

Composite SLA calculation

When your architecture uses multiple Azure services, the composite SLA is the product of individual SLAs:

Service	Individual SLA
Azure App Service	99.95%
Azure SQL Database	99.99%
Azure Storage (GRS)	99.99%

Composite SLA = 0.9995 x 0.9999 x 0.9999 = 99.93% (about 30 minutes downtime/month)

Improving composite SLA

Technique	How It Helps
Availability Zones	Survives data centre failure — increases individual service SLA
Multi-region deployment	Survives regional failure — can achieve 99.999%+
Redundant paths	If component A fails, component B handles traffic
Queue-based decoupling	Services communicate asynchronously — one failure doesn’t cascade

🏗️ Priya’s SLA design: GlobalTech needs 99.99% for their customer portal. A single-region deployment gives 99.93%. Priya added:

Multi-region App Service with Traffic Manager failover
SQL auto-failover group (secondary in paired region)
Result: individual failures don’t cause customer-visible downtime

Well-Architected Framework connection

Reliability pillar is entirely about meeting recovery objectives:

Design for failure at every layer
Quantify RPO/RTO/SLA before choosing technologies
Test recovery procedures regularly (chaos engineering)
Balance availability investment with business value

Cost Optimisation: Every additional “nine” of availability costs significantly more. 99.9% → 99.99% might double your infrastructure cost. The architect must justify each level.

Backup vs DR vs HA — the three pillars

Concept	Purpose	Scope	Example
Backup	Recover from data loss or corruption	Data recovery (RPO)	Azure Backup restoring a VM from yesterday
Disaster Recovery (DR)	Resume operations after regional/site failure	Service recovery (RPO + RTO)	Azure Site Recovery failing over to secondary region
High Availability (HA)	Prevent downtime from component failures	Continuous operation (uptime)	Availability Zones — app survives data centre failure

Critical distinction: Backup is NOT DR. DR is NOT HA. A backup protects your data but doesn’t keep the service running. DR gets you running again after a failure. HA prevents the failure from being visible to users. You need all three — designed to match your recovery objectives.

Knowledge check

Question

What does RPO measure?

Click or press Enter to reveal answer

Answer

Recovery Point Objective — the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can lose up to 1 hour of data. RPO of 0 means zero data loss (requires synchronous replication). Lower RPO = more frequent backups or real-time replication = higher cost.

Click to flip back

Question

How do you calculate composite SLA for a multi-service architecture?

Click or press Enter to reveal answer

Answer

Multiply the individual SLAs together. Example: 99.95% x 99.99% x 99.99% = 99.93%. Each additional service in the chain REDUCES the composite SLA. To improve it: add redundancy (multi-region), use availability zones, or decouple with queues.

Click to flip back

Question

What's the difference between backup, DR, and HA?

Click or press Enter to reveal answer

Answer

Backup: recovers DATA after loss/corruption (answers RPO). DR: resumes SERVICE after site/region failure (answers RPO + RTO). HA: prevents downtime from component failures (answers uptime SLA). You need all three — backup alone doesn't keep services running, and HA alone doesn't protect against data corruption.

Click to flip back

Knowledge Check

🏦 FinSecure Bank's trading platform requires zero data loss and zero downtime. Their internal reporting dashboard can tolerate up to 4 hours of downtime and 1 hour of data loss. What recovery tiers should Elena assign?

Knowledge Check

🏗️ Priya's customer portal uses App Service (99.95%), Azure SQL Database (99.99%), and Blob Storage (99.99%). The business requires 99.99% availability. The current composite SLA is 99.93%. What should Priya add to meet the target?

Next up: Recovery targets are set — now let’s design the backup solution — Backup & Recovery for Compute.