Domain 5 β€” Module 1 of 6 17%
23 of 28 overall
Domain 5: Plan and Configure an HA/DR Environment Free ⏱ ~13 min read

HA/DR Strategy: RPO, RTO, and Architecture

Plan high availability and disaster recovery strategies based on RPO/RTO requirements. Evaluate solutions for hybrid and Azure-only deployments.

Planning for disaster

Simple explanation

HA/DR is insurance for your database.

High Availability (HA) keeps things running during small failures β€” a server restarts, a disk fails. Like having a spare tyre in the boot.

Disaster Recovery (DR) keeps things running during big failures β€” an entire data centre goes down. Like having a second car at a different location.

RPO (Recovery Point Objective) = how much data can you afford to lose? β€œWe can lose up to 5 minutes of transactions.”

RTO (Recovery Time Objective) = how fast must you recover? β€œWe need to be back online within 1 hour.”

RPO and RTO explained

MetricQuestion It AnswersMeasured InExample
RPOHow much data loss is acceptable?Time (seconds, minutes, hours)RPO = 5 min β†’ lose at most 5 minutes of transactions
RTOHow long can the system be down?Time (minutes, hours)RTO = 1 hour β†’ must be back online within 60 minutes

RPO/RTO by solution

HA/DR Solutions: RPO and RTO
SolutionRPORTOPlatformAutomatic Failover?
Built-in HA (local redundancy)0 (synchronous)< 30 secSQL DB, MIYes
Zone-redundant HA0 (synchronous)< 30 secSQL DB, MIYes
Active geo-replication< 5 sec< 30 sec (manual failover)SQL DB onlyNo (manual)
Failover groups< 5 sec< 1 hour (automatic)SQL DB, MIYes
Always On AG (sync)0< 1 minSQL VMs, MIYes (with listener)
Always On AG (async)MinutesMinutes to hoursSQL VMsManual
Log shippingMinutes to hoursMinutes to hoursSQL VMsManual
Backup/restoreHours (depends on backup frequency)HoursAllManual

Azure-specific HA/DR solutions

Built-in high availability

Every Azure SQL Database and MI comes with HA β€” no configuration needed:

TierHA ArchitectureReplicas
General PurposeRemote storage with compute failover1 primary (failover to standby)
Business CriticalLocal SSD with Always On AG1 primary + 1-3 readable secondary replicas
HyperscalePage server architecture0-4 named replicas (read or HA)

Zone-redundant deployments

  • Spread replicas across availability zones in the same region
  • Protects against data centre (zone) failures
  • Available for SQL DB (Premium, Business Critical, Hyperscale) and MI (Business Critical)

Hybrid HA/DR

Kenji’s hybrid strategy for NorthStar:

ScenarioSolution
On-prem SQL Server + Azure SQL VMDistributed AG spanning on-prem and Azure VM
On-prem SQL Server + Azure SQL MIMI link for near real-time replication
On-prem backup to cloudBackup to Azure Blob Storage via BACKUP TO URL
Gradual migration with DRLog shipping to Azure VM during migration
Managed Instance link

The MI link creates a near real-time replication connection between on-prem SQL Server (or Azure VM) and Azure SQL Managed Instance:

  • Uses distributed availability group technology
  • One-way replication: on-prem β†’ MI (readable secondary)
  • Can be used for DR (failover to MI if on-prem fails)
  • Can be used for migration (cutover to MI when ready)
  • SQL Server 2016+ supported as source

Testing HA/DR

A plan you’ve never tested is a plan that won’t work. Kenji’s testing procedures:

TestHowFrequency
Planned failoverInitiate failover group failover to secondary regionQuarterly
Backup restoreRestore a recent backup to a test server and validateMonthly
Point-in-time restoreRestore to a specific time, verify data integrityQuarterly
DR drillSimulate primary region failure, verify applications connect to secondaryAnnually
Runbook validationWalk through DR runbook steps with the teamSemi-annually

Testing checklist:

  1. Define success criteria before testing
  2. Notify stakeholders of the test window
  3. Verify application connectivity after failover
  4. Measure actual RTO (was it within target?)
  5. Verify data integrity (was RPO met?)
  6. Document results and update the DR runbook
Question

What is the difference between RPO and RTO?

Click or press Enter to reveal answer

Answer

RPO = how much data loss is acceptable (measured in time). RTO = how long the system can be down before recovery (measured in time). Lower values = higher cost and complexity.

Click to flip back

Question

What HA architecture does Azure SQL Database Business Critical tier use?

Click or press Enter to reveal answer

Answer

Always On Availability Group with local SSD storage. 1 primary + up to 3 readable secondary replicas. Synchronous replication within the cluster. RPO = 0.

Click to flip back

Question

What is the MI link used for?

Click or press Enter to reveal answer

Answer

Near real-time replication from on-prem SQL Server (or Azure VM) to Azure SQL Managed Instance. Uses distributed AG technology. One-way replication for DR or migration purposes.

Click to flip back

Knowledge Check

NorthStar's ERP system requires RPO of 0 (zero data loss) and RTO under 1 minute. The database runs on Azure SQL Managed Instance. Which HA solution meets these requirements?

Knowledge Check

Kenji needs DR for on-premises SQL Server 2019 to Azure, with the ability to fail over to Azure if the data centre goes down. What should he implement?

Next up: Backup and Restore: Strategy and Native Tools β€” plan backup strategies and execute backups using native tools and T-SQL.