Disaster Recovery Blueprints
Overview
Disaster Recovery (DR) ensures business continuity when primary resources fail. This skill covers multi-region architectures, backup strategies, and automated failover patterns.
mermaid
graph TB
subgraph "Primary Region (us-east-1)"
ALB1[ALB]
EKS1[EKS Cluster]
RDS1[(RDS Primary)]
S3_1[S3 Bucket]
end
subgraph "DR Region (us-west-2)"
ALB2[ALB - Standby]
EKS2[EKS Cluster - Standby]
RDS2[(RDS Read Replica)]
S3_2[S3 Replica]
end
Route53[Route 53<br/>Health Checks] --> ALB1
Route53 -.->|Failover| ALB2
RDS1 -->|Async Replication| RDS2
S3_1 -->|Cross-Region Replication| S3_2
DR Strategies
| Strategy | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Non-critical workloads |
| Pilot Light | 10-30 min | Minutes | $$ | Core systems |
| Warm Standby | Minutes | Seconds | $$$ | Important workloads |
| Active-Active | Zero | Zero | $$$$ | Mission-critical |
Best Practices
- •Define RTO/RPO first - Drive architecture decisions
- •Automate failover - Manual steps increase RTO
- •Test regularly - Quarterly DR drills
- •Document runbooks - Clear failover procedures
- •Monitor replication lag - Alert on RPO violations
- •Use Route 53 health checks - Automatic DNS failover
- •Version infrastructure - Same IaC deploys to both regions
Example 1: Terraform - Multi-Region RDS with Failover
Complete DR setup with RDS read replica and Route 53 failover.
📁 Location: terraform/examples/disaster-recovery/
Key Features
hcl
# Primary RDS instance
resource "aws_db_instance" "primary" {
provider = aws.primary
identifier = "${local.name_prefix}-primary"
engine = "postgres"
backup_retention_period = 7
multi_az = true # HA within region
}
# Cross-region read replica (can be promoted)
resource "aws_db_instance" "replica" {
provider = aws.dr
identifier = "${local.name_prefix}-replica"
replicate_source_db = aws_db_instance.primary.arn
# Ready to promote on failover
backup_retention_period = 7
}
# Route 53 health check for automatic failover
resource "aws_route53_health_check" "primary" {
fqdn = aws_lb.primary.dns_name
type = "HTTPS"
port = 443
failure_threshold = 3
}
Example 2: CDK - S3 Cross-Region Replication + Route 53
Multi-region storage with automatic replication.
📁 Location: cdk/examples/disaster-recovery/
Key Features
typescript
// Primary bucket with replication
const primaryBucket = new s3.Bucket(this, 'PrimaryBucket', {
bucketName: `${props.projectName}-primary-${this.account}`,
versioned: true, // Required for replication
encryption: s3.BucketEncryption.S3_MANAGED,
});
// DR bucket in another region
const drBucket = new s3.Bucket(drStack, 'DrBucket', {
bucketName: `${props.projectName}-dr-${drStack.account}`,
versioned: true,
});
// Cross-region replication rule
primaryBucket.addReplicationRule({
destination: drBucket,
replicateEncryptedObjects: true,
});
// Route 53 failover routing
const healthCheck = new route53.HealthCheck(this, 'HealthCheck', {
type: route53.HealthCheckType.HTTPS,
fqdn: primaryAlb.loadBalancerDnsName,
});
new route53.ARecord(this, 'FailoverRecord', {
zone: hostedZone,
recordName: 'app',
target: route53.RecordTarget.fromAlias(
new route53targets.LoadBalancerTarget(primaryAlb)
),
setIdentifier: 'primary',
healthCheck: healthCheck,
});
RTO/RPO Quick Reference
| Component | Backup Frequency | Replication | RTO Target |
|---|---|---|---|
| RDS | Daily snapshots | Cross-region replica | 15 min |
| S3 | Versioned | CRR (async) | 5 min |
| EKS | Config in Git | Multi-region | 30 min |
| Secrets | N/A | Replicated | 5 min |
Validation Checklist
- • RTO/RPO targets defined and documented
- • Cross-region replication enabled for data stores
- • Route 53 health checks configured
- • Failover runbook documented
- • DR drills scheduled quarterly
- • Replication lag monitoring and alerts
- • Same IaC code deploys to both regions
Related Skills
- •Network Segmentation - Multi-region VPC
- •Remote State Boundaries - Regional state
- •Blue/Green Canary - Safe deployments