Disaster Recovery
RTO vs RPO
| Metric | Definition | Example |
|---|---|---|
| RTO | Recovery Time Objective | 4 hours |
| RPO | Recovery Point Objective | 1 hour |
RTO: How long until service is restored? RPO: How much data can we lose?
DR Strategies
| Strategy | RTO | Cost |
|---|---|---|
| Backup & Restore | Hours | $ |
| Pilot Light | Minutes | $$ |
| Warm Standby | Minutes | $$$ |
| Multi-Site Active | Seconds | $$$$ |
Backup Strategy
Database
bash
# PostgreSQL pg_dump -h host -U user dbname | gzip > backup.sql.gz # MySQL mysqldump -h host -u user -p dbname | gzip > backup.sql.gz
Automated Backups
hcl
# AWS RDS
resource "aws_db_instance" "main" {
backup_retention_period = 7
backup_window = "03:00-04:00"
# Cross-region replica for DR
replicate_source_db = aws_db_instance.primary.arn
}
DR Runbook
Failover Steps
- •Detect - Monitor alerts for primary failure
- •Assess - Confirm failure, estimate recovery
- •Decide - Failover if RTO exceeded
- •Execute - Run failover procedure
- •Verify - Test functionality
- •Communicate - Update stakeholders
Failback Steps
- •Verify primary is healthy
- •Sync data from secondary
- •Switch traffic back
- •Monitor closely
Testing
| Test Type | Frequency |
|---|---|
| Backup restore | Monthly |
| Failover drill | Quarterly |
| Full DR test | Annually |
Multi-Region
code
Primary (us-east-1) Secondary (us-west-2)
┌─────────────────┐ ┌─────────────────┐
│ App + DB │ ──sync── │ DB Replica │
└─────────────────┘ └─────────────────┘
│ │
└──────── Route 53 ──────────┘
(failover)