Backup/Restore Runbook Generator

Create reliable disaster recovery procedures for your databases.

Backup Strategy

markdown

# Database Backup Strategy

## Backup Types

### 1. Full Backup (Daily)

- **When**: 2:00 AM UTC
- **Retention**: 30 days
- **Storage**: S3 `s3://backups/full/`
- **Size**: ~50 GB
- **Duration**: ~45 minutes

### 2. Incremental Backup (Hourly)

- **When**: Every hour
- **Retention**: 7 days
- **Storage**: S3 `s3://backups/incremental/`
- **Size**: ~500 MB
- **Duration**: ~5 minutes

### 3. Transaction Log Backup (Every 15 min)

- **When**: Every 15 minutes
- **Retention**: 3 days
- **Storage**: S3 `s3://backups/wal/`
- **Point-in-time recovery capability**

## Backup Automation

### PostgreSQL

```bash
#!/bin/bash
# scripts/backup-postgres.sh

set -e

# Configuration
DB_NAME="production"
DB_USER="postgres"
DB_HOST="postgres.example.com"
BACKUP_DIR="/var/backups/postgres"
S3_BUCKET="s3://my-backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"

# Create backup directory
mkdir -p $BACKUP_DIR

echo "🔄 Starting backup: $FILENAME"

# Full backup with pg_dump
pg_dump \
  --host=$DB_HOST \
  --username=$DB_USER \
  --dbname=$DB_NAME \
  --format=custom \
  --compress=9 \
  --file=$BACKUP_DIR/$FILENAME \
  --verbose

# Verify backup
if [ -f "$BACKUP_DIR/$FILENAME" ]; then
  SIZE=$(du -h "$BACKUP_DIR/$FILENAME" | cut -f1)
  echo "✅ Backup created: $SIZE"
else
  echo "❌ Backup failed"
  exit 1
fi

# Upload to S3
echo "📤 Uploading to S3..."
aws s3 cp $BACKUP_DIR/$FILENAME $S3_BUCKET/ \
  --storage-class STANDARD_IA

# Verify upload
if aws s3 ls $S3_BUCKET/$FILENAME; then
  echo "✅ Uploaded to S3"
else
  echo "❌ S3 upload failed"
  exit 1
fi

# Cleanup old local backups (keep last 7 days)
find $BACKUP_DIR -type f -name "*.sql.gz" -mtime +7 -delete
echo "🗑️  Cleaned up old local backups"

# Send notification
curl -X POST $SLACK_WEBHOOK \
  -H 'Content-Type: application/json' \
  -d "{\"text\": \"✅ Database backup complete: $FILENAME ($SIZE)\"}"

echo "✅ Backup complete!"
```

MySQL

bash

#!/bin/bash
# scripts/backup-mysql.sh

set -e

DB_NAME="production"
DB_USER="root"
DB_PASSWORD=$MYSQL_PASSWORD
DATE=$(date +%Y%m%d_%H%M%S)
FILENAME="${DB_NAME}_${DATE}.sql.gz"

echo "🔄 Starting MySQL backup..."

# Backup with mysqldump
mysqldump \
  --user=$DB_USER \
  --password=$DB_PASSWORD \
  --single-transaction \
  --quick \
  --lock-tables=false \
  --databases $DB_NAME \
  | gzip > /var/backups/mysql/$FILENAME

# Upload to S3
aws s3 cp /var/backups/mysql/$FILENAME s3://my-backups/mysql/

echo "✅ Backup complete!"

Restore Procedures

Full Restore

bash

#!/bin/bash
# scripts/restore-postgres.sh

set -e

BACKUP_FILE=$1
RESTORE_DB="production_restored"

if [ -z "$BACKUP_FILE" ]; then
  echo "Usage: ./restore-postgres.sh <backup-file>"
  exit 1
fi

echo "🔄 Starting restore from: $BACKUP_FILE"

# 1. Download from S3
echo "📥 Downloading backup..."
aws s3 cp s3://my-backups/postgres/$BACKUP_FILE /tmp/

# 2. Create new database
echo "🗄️  Creating database..."
psql -h $DB_HOST -U postgres -c "CREATE DATABASE $RESTORE_DB;"

# 3. Restore backup
echo "🔄 Restoring data..."
pg_restore \
  --host=$DB_HOST \
  --username=postgres \
  --dbname=$RESTORE_DB \
  --verbose \
  /tmp/$BACKUP_FILE

# 4. Verify restore
echo "✅ Verifying restore..."
TABLE_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "  Tables restored: $TABLE_COUNT"

ROW_COUNT=$(psql -h $DB_HOST -U postgres -d $RESTORE_DB -t -c "SELECT COUNT(*) FROM users;")
echo "  User rows: $ROW_COUNT"

echo "✅ Restore complete!"
echo "  Database: $RESTORE_DB"
echo "  To use: UPDATE application config to point to $RESTORE_DB"

Point-in-Time Recovery (PITR)

bash

#!/bin/bash
# scripts/pitr-restore.sh

TARGET_TIME=$1  # Format: 2024-01-15 14:30:00

echo "🔄 Point-in-Time Restore to: $TARGET_TIME"

# 1. Restore base backup
echo "📦 Restoring base backup..."
pg_basebackup -D /var/lib/postgresql/data -X stream

# 2. Configure recovery
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF

# 3. Start PostgreSQL
echo "🚀 Starting PostgreSQL in recovery mode..."
systemctl start postgresql

# 4. Wait for recovery
while ! pg_isready; do
  echo "  Waiting for recovery..."
  sleep 5
done

echo "✅ PITR complete!"

Validation Checks

bash

#!/bin/bash
# scripts/validate-restore.sh

DB=$1

echo "🔍 Validating restore..."

# 1. Check table count
TABLES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
echo "Tables: $TABLES"

if [ "$TABLES" -lt 10 ]; then
  echo "❌ Too few tables restored"
  exit 1
fi

# 2. Check row counts
for table in users products orders; do
  ROWS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM $table;")
  echo "  $table: $ROWS rows"

  if [ "$ROWS" -lt 1 ]; then
    echo "❌ Table $table is empty"
    exit 1
  fi
done

# 3. Check constraints
CONSTRAINTS=$(psql -d $DB -t -c "SELECT COUNT(*) FROM information_schema.table_constraints WHERE constraint_type='FOREIGN KEY';")
echo "Foreign keys: $CONSTRAINTS"

# 4. Check indexes
INDEXES=$(psql -d $DB -t -c "SELECT COUNT(*) FROM pg_indexes WHERE schemaname='public';")
echo "Indexes: $INDEXES"

# 5. Test query performance
START=$(date +%s%N)
psql -d $DB -c "SELECT COUNT(*) FROM users WHERE email LIKE '%@example.com%';" > /dev/null
END=$(date +%s%N)
DURATION=$(( (END - START) / 1000000 ))
echo "Query performance: ${DURATION}ms"

if [ "$DURATION" -gt 1000 ]; then
  echo "⚠️  Slow query - missing indexes?"
fi

echo "✅ Validation complete!"

Disaster Recovery Runbook

markdown

# Disaster Recovery Runbook

## Incident Response

### 1. Assess Situation (5 minutes)

- [ ] Identify incident severity (P0/P1/P2)
- [ ] Determine data loss window
- [ ] Notify stakeholders

**Contacts:**

- DBA On-Call: [phone]
- Engineering Lead: [phone]
- CTO: [phone]

### 2. Stop the Bleeding (10 minutes)

- [ ] Enable maintenance mode
- [ ] Stop writes to corrupted database
- [ ] Preserve evidence (logs, backups)

```bash
# Enable maintenance mode
kubectl scale deployment/api --replicas=0
```

3. Identify Recovery Point (15 minutes)

• Determine last good backup
• Check backup integrity
• Calculate data loss

bash

# List available backups
aws s3 ls s3://my-backups/postgres/ | tail -20

# Check backup size
aws s3 ls s3://my-backups/postgres/production_20240115_020000.sql.gz --human-readable

4. Prepare Recovery Environment (30 minutes)

• Spin up new database instance
• Configure networking
• Test connectivity

bash

# Create RDS instance
aws rds create-db-instance \
  --db-instance-identifier production-recovery \
  --db-instance-class db.r6g.xlarge \
  --engine postgres \
  --master-username postgres \
  --master-user-password [secure-password]

5. Execute Restore (1-2 hours)

• Download backup from S3
• Run restore script
• Apply transaction logs (if PITR)
• Verify data integrity

bash

# Run restore
./scripts/restore-postgres.sh production_20240115_020000.sql.gz

# Validate
./scripts/validate-restore.sh production_restored

6. Validate and Test (30 minutes)

• Run validation scripts
• Test critical queries
• Verify row counts
• Check data consistency

7. Cutover (15 minutes)

• Update application config
• Point DNS to new database
• Disable maintenance mode
• Monitor for errors

bash

# Update connection string
kubectl set env deployment/api DATABASE_URL=postgresql://...

# Scale up
kubectl scale deployment/api --replicas=3

8. Post-Recovery (1 hour)

• Monitor system health
• Verify user reports
• Document incident
• Schedule postmortem

Recovery Time Objective (RTO)

Scenario	Target	Actual
Full restore	2 hours	[measured]
PITR restore	3 hours	[measured]
Region failover	15 minutes	[measured]

Recovery Point Objective (RPO)

Backup Type	Data Loss Window
Full backup	24 hours
Incremental	1 hour
Transaction logs	15 minutes

code


## Automated Backup Monitoring

```typescript
// scripts/monitor-backups.ts
import { S3Client, ListObjectsV2Command } from '@aws-sdk/client-s3';

const s3 = new S3Client({ region: 'us-east-1' });

async function checkBackupHealth() {
  const bucket = 'my-backups';
  const prefix = 'postgres/';

  // List recent backups
  const command = new ListObjectsV2Command({
    Bucket: bucket,
    Prefix: prefix,
    MaxKeys: 10,
  });

  const response = await s3.send(command);
  const backups = response.Contents || [];

  // Check last backup age
  const latestBackup = backups[0];
  const age = Date.now() - new Date(latestBackup.LastModified!).getTime();
  const ageHours = age / (1000 * 60 * 60);

  if (ageHours > 25) {
    console.error('❌ No backup in last 24 hours!');
    // Send alert
    await sendSlackAlert('No recent database backup!');
    process.exit(1);
  }

  // Check backup size
  const size = latestBackup.Size! / (1024 * 1024 * 1024); // GB
  if (size < 10) {
    console.error('⚠️  Backup size suspiciously small');
  }

  console.log('✅ Backup health check passed');
  console.log(`  Latest: ${latestBackup.Key}`);
  console.log(`  Age: ${ageHours.toFixed(1)} hours`);
  console.log(`  Size: ${size.toFixed(2)} GB`);
}

checkBackupHealth();

Role Assignments

markdown

## DR Team Roles

### Database Administrator (Primary)

- Execute restore procedures
- Verify data integrity
- Monitor recovery progress

### Engineering Lead

- Coordinate response
- Communicate with stakeholders
- Make cutover decisions

### DevOps Engineer

- Provision infrastructure
- Update application configs
- Monitor system health

### Product Manager

- Assess business impact
- Prioritize recovery
- Customer communication

## Escalation Path

1. DBA on-call →
2. Engineering Lead →
3. CTO →
4. CEO (P0 incidents only)

Best Practices

•Test restores regularly: Quarterly DR drills
•Automate backups: Never rely on manual processes
•Multiple locations: Cross-region backup storage
•Monitor backup health: Alert on failures
•Document procedures: Keep runbook updated
•Encrypt backups: Protect sensitive data
•Version control: Track backup script changes

backup-restore-runbook-generator

Backup/Restore Runbook Generator

Backup Strategy

MySQL

Restore Procedures

Full Restore

Point-in-Time Recovery (PITR)

Validation Checks

Disaster Recovery Runbook

3. Identify Recovery Point (15 minutes)

4. Prepare Recovery Environment (30 minutes)

5. Execute Restore (1-2 hours)

6. Validate and Test (30 minutes)

7. Cutover (15 minutes)

8. Post-Recovery (1 hour)

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Role Assignments

Best Practices

Output Checklist