Deployment Runbook
Overview
This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.
When to Use This Skill
- •Planning production deployments
- •Executing staged rollouts (canary, blue-green)
- •Performing post-deployment health checks
- •Rolling back failed deployments
- •Troubleshooting deployment issues
- •Establishing deployment best practices
- •Complementing the devops-automation agent for deployments
Pre-Deployment Checklist
Before any production deployment:
- • Code Review: All changes reviewed and approved
- • Tests Pass: CI/CD pipeline green
- •Unit tests: ✓
- •Integration tests: ✓
- •E2E tests: ✓
- • Database Migrations: Tested in staging
- •Backward compatible
- •Rollback script prepared
- • Configuration: Environment variables verified
- •Secrets rotated if needed
- •Feature flags configured
- • Monitoring: Dashboards and alerts ready
- •Error tracking enabled
- •Performance monitoring active
- •Log aggregation configured
- • Communication: Stakeholders notified
- •Deployment window announced
- •On-call engineer assigned
- •Rollback plan documented
- • Backups: Recent backup verified
- •Database backed up < 1 hour ago
- •Backup restoration tested
- • Capacity: Resources scaled appropriately
- •Auto-scaling configured
- •Rate limits reviewed
- •CDN cache warmed
Deployment Strategies
1. Blue-Green Deployment
Best for: Zero-downtime deployments, easy rollbacks
Process:
- •Deploy to inactive (green) environment
- •Run health checks on green
- •Switch traffic from blue to green
- •Monitor for issues
- •Keep blue as instant rollback option
Commands:
# Deploy to green environment ./deploy.sh --env green # Run health checks python3 scripts/health_check.py --env green # Switch traffic (gradual) ./switch_traffic.sh --from blue --to green --percentage 10 ./switch_traffic.sh --from blue --to green --percentage 50 ./switch_traffic.sh --from blue --to green --percentage 100 # If issues: instant rollback ./switch_traffic.sh --from green --to blue --percentage 100
2. Canary Deployment
Best for: Risk-averse deployments, gradual rollouts
Process:
- •Deploy to small subset of servers (5-10%)
- •Monitor metrics closely
- •Gradually increase percentage
- •Roll back if metrics degrade
Monitoring During Canary:
- •Error rate < baseline + 1%
- •Response time < baseline + 10%
- •Success rate > 99.9%
3. Rolling Deployment
Best for: Standard updates, resource-constrained environments
Process:
- •Take one instance out of load balancer
- •Deploy new version
- •Run health checks
- •Add back to load balancer
- •Repeat for remaining instances
Deployment Workflow
Phase 1: Pre-Deployment (T-30 minutes)
# 1. Verify staging environment ./verify_staging.sh # 2. Create deployment tag git tag -a v1.2.3 -m "Release 1.2.3" git push origin v1.2.3 # 3. Trigger production build ./build_production.sh --tag v1.2.3 # 4. Backup database ./backup_db.sh --environment production # 5. Notify team ./notify_slack.sh "🚀 Starting deployment v1.2.3 in 30 minutes"
Phase 2: Deployment (T-0)
# 1. Enable maintenance mode (if needed) ./maintenance_mode.sh --enable # 2. Run database migrations ./run_migrations.sh --environment production # 3. Deploy application ./deploy.sh --environment production --version v1.2.3 # 4. Disable maintenance mode ./maintenance_mode.sh --disable
Phase 3: Post-Deployment Health Checks
# Run comprehensive health checks python3 scripts/health_check.py --environment production # Expected output: # ✓ API health endpoint responding # ✓ Database connectivity OK # ✓ Cache layer accessible # ✓ External services reachable # ✓ Error rate within threshold # ✓ Response time within SLA
Phase 4: Monitoring (T+30 minutes)
Monitor these metrics:
Application Metrics:
- •Error rate: < 0.1%
- •Response time (p95): < 200ms
- •Request throughput: within expected range
- •Success rate: > 99.9%
Infrastructure Metrics:
- •CPU utilization: < 70%
- •Memory usage: < 80%
- •Disk I/O: normal patterns
- •Network latency: < 50ms
Business Metrics:
- •Conversion rate: no significant drop
- •User signups: within expected range
- •Transaction volume: normal patterns
Rollback Procedures
When to Rollback
Rollback immediately if:
- •Error rate > 1%
- •Critical functionality broken
- •Data corruption detected
- •Security vulnerability introduced
- •Performance degradation > 50%
Rollback Methods
Method 1: Traffic Switch (Fastest)
# Blue-green: instant rollback ./switch_traffic.sh --from green --to blue --percentage 100 # Verification python3 scripts/health_check.py --environment production
Method 2: Version Revert
# Deploy previous version ./deploy.sh --environment production --version v1.2.2 # Run health checks python3 scripts/health_check.py --environment production
Method 3: Database Rollback
# If migrations were applied ./rollback_migration.sh --environment production --steps 1 # Restore from backup (last resort) ./restore_db.sh --backup latest --environment production
Post-Rollback
- •
Verify system health
bashpython3 scripts/health_check.py --environment production
- •
Notify stakeholders
bash./notify_slack.sh "⚠️ Deployment v1.2.3 rolled back. System stable on v1.2.2"
- •
Create postmortem
- •What went wrong?
- •Why didn't we catch it?
- •How do we prevent recurrence?
Health Check Script
Use the included health check script:
# Run all checks python3 scripts/health_check.py --env production # Run specific check python3 scripts/health_check.py --env production --check api # Verbose output python3 scripts/health_check.py --env production --verbose
See scripts/health_check.py for implementation.
Troubleshooting Guide
Issue: Deployment Hangs
Symptoms:
- •Deployment script doesn't complete
- •Services not starting
Diagnosis:
# Check service logs kubectl logs -f deployment/app-name # Check events kubectl get events --sort-by='.lastTimestamp'
Resolution:
- •Increase timeout values
- •Check resource constraints
- •Verify image pull secrets
Issue: High Error Rate Post-Deployment
Symptoms:
- •Error rate spike
- •500 errors in logs
Diagnosis:
# Check application logs
tail -f /var/log/app/error.log
# Check error distribution
grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr
Resolution:
- •Check configuration changes
- •Verify environment variables
- •Review recent code changes
- •Consider immediate rollback
Issue: Database Connection Failures
Symptoms:
- •"Connection refused" errors
- •Timeout errors
Diagnosis:
# Test database connectivity python3 scripts/test_db_connection.py # Check connection pool psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"
Resolution:
- •Verify connection strings
- •Check firewall rules
- •Increase connection pool size
- •Verify credentials
Communication Templates
Pre-Deployment Announcement
🚀 **Production Deployment Scheduled** **Version**: v1.2.3 **Time**: 2024-01-15 14:00 UTC (30 minutes) **Duration**: ~15 minutes **Impact**: No expected downtime **Changes**: - Feature: New user dashboard - Fix: Payment processing bug - Performance: API response time improvements **Rollback Plan**: Blue-green switch (instant) **On-Call**: @engineer-name
Deployment Success
✅ **Deployment Complete** **Version**: v1.2.3 **Status**: Successful **Duration**: 12 minutes **Health Checks**: All passing ✓ **Metrics**: Within normal range **Next Check**: T+30 minutes Monitoring dashboard: [link]
Deployment Rollback
⚠️ **Deployment Rolled Back** **Version**: v1.2.3 → v1.2.2 (rollback) **Reason**: Elevated error rate (2.1%) **Status**: System stable on v1.2.2 **Action Items**: - [ ] Root cause analysis - [ ] Fix identified issue - [ ] Re-test in staging - [ ] Schedule re-deployment Incident report: [link]
Resources
scripts/
- •health_check.py: Comprehensive deployment health checks
- •test_db_connection.py: Database connectivity verification
references/
- •deployment-checklist.md: Detailed pre/post deployment checklist
- •monitoring-guide.md: Metrics to monitor during deployments
Best Practices
- •Always deploy during low-traffic windows
- •Never deploy on Fridays (unless critical hotfix)
- •Keep deployments small (< 200 lines changed)
- •Monitor for 30+ minutes post-deployment
- •Document every rollback with postmortem
- •Test rollback procedure in staging first
- •Use feature flags for risky changes
- •Automate health checks (don't rely on manual verification)
Quick Reference
Emergency Rollback:
./switch_traffic.sh --from green --to blue --percentage 100
Health Check:
python3 scripts/health_check.py --env production
View Logs:
kubectl logs -f deployment/app-name --tail=100
Check Metrics:
curl https://metrics.example.com/api/health