AgentSkillsCN

Disaster Recovery Blueprints

采用自动化故障转移、备份策略及 RTO/RPO 目标,构建多区域 DR 模式

SKILL.md
--- frontmatter
name: Disaster Recovery Blueprints
description: Multi-region DR patterns with automated failover, backup strategies, and RTO/RPO targets

Disaster Recovery Blueprints

Overview

Disaster Recovery (DR) ensures business continuity when primary resources fail. This skill covers multi-region architectures, backup strategies, and automated failover patterns.

mermaid
graph TB
    subgraph "Primary Region (us-east-1)"
        ALB1[ALB]
        EKS1[EKS Cluster]
        RDS1[(RDS Primary)]
        S3_1[S3 Bucket]
    end
    
    subgraph "DR Region (us-west-2)"
        ALB2[ALB - Standby]
        EKS2[EKS Cluster - Standby]
        RDS2[(RDS Read Replica)]
        S3_2[S3 Replica]
    end
    
    Route53[Route 53<br/>Health Checks] --> ALB1
    Route53 -.->|Failover| ALB2
    RDS1 -->|Async Replication| RDS2
    S3_1 -->|Cross-Region Replication| S3_2

DR Strategies

StrategyRTORPOCostUse Case
Backup & RestoreHoursHours$Non-critical workloads
Pilot Light10-30 minMinutes$$Core systems
Warm StandbyMinutesSeconds$$$Important workloads
Active-ActiveZeroZero$$$$Mission-critical

Best Practices

  1. Define RTO/RPO first - Drive architecture decisions
  2. Automate failover - Manual steps increase RTO
  3. Test regularly - Quarterly DR drills
  4. Document runbooks - Clear failover procedures
  5. Monitor replication lag - Alert on RPO violations
  6. Use Route 53 health checks - Automatic DNS failover
  7. Version infrastructure - Same IaC deploys to both regions

Example 1: Terraform - Multi-Region RDS with Failover

Complete DR setup with RDS read replica and Route 53 failover.

📁 Location: terraform/examples/disaster-recovery/

Key Features

hcl
# Primary RDS instance
resource "aws_db_instance" "primary" {
  provider               = aws.primary
  identifier             = "${local.name_prefix}-primary"
  engine                 = "postgres"
  backup_retention_period = 7
  multi_az               = true  # HA within region
}

# Cross-region read replica (can be promoted)
resource "aws_db_instance" "replica" {
  provider             = aws.dr
  identifier           = "${local.name_prefix}-replica"
  replicate_source_db  = aws_db_instance.primary.arn
  
  # Ready to promote on failover
  backup_retention_period = 7
}

# Route 53 health check for automatic failover
resource "aws_route53_health_check" "primary" {
  fqdn              = aws_lb.primary.dns_name
  type              = "HTTPS"
  port              = 443
  failure_threshold = 3
}

Example 2: CDK - S3 Cross-Region Replication + Route 53

Multi-region storage with automatic replication.

📁 Location: cdk/examples/disaster-recovery/

Key Features

typescript
// Primary bucket with replication
const primaryBucket = new s3.Bucket(this, 'PrimaryBucket', {
  bucketName: `${props.projectName}-primary-${this.account}`,
  versioned: true,  // Required for replication
  encryption: s3.BucketEncryption.S3_MANAGED,
});

// DR bucket in another region
const drBucket = new s3.Bucket(drStack, 'DrBucket', {
  bucketName: `${props.projectName}-dr-${drStack.account}`,
  versioned: true,
});

// Cross-region replication rule
primaryBucket.addReplicationRule({
  destination: drBucket,
  replicateEncryptedObjects: true,
});

// Route 53 failover routing
const healthCheck = new route53.HealthCheck(this, 'HealthCheck', {
  type: route53.HealthCheckType.HTTPS,
  fqdn: primaryAlb.loadBalancerDnsName,
});

new route53.ARecord(this, 'FailoverRecord', {
  zone: hostedZone,
  recordName: 'app',
  target: route53.RecordTarget.fromAlias(
    new route53targets.LoadBalancerTarget(primaryAlb)
  ),
  setIdentifier: 'primary',
  healthCheck: healthCheck,
});

RTO/RPO Quick Reference

ComponentBackup FrequencyReplicationRTO Target
RDSDaily snapshotsCross-region replica15 min
S3VersionedCRR (async)5 min
EKSConfig in GitMulti-region30 min
SecretsN/AReplicated5 min

Validation Checklist

  • RTO/RPO targets defined and documented
  • Cross-region replication enabled for data stores
  • Route 53 health checks configured
  • Failover runbook documented
  • DR drills scheduled quarterly
  • Replication lag monitoring and alerts
  • Same IaC code deploys to both regions

Related Skills