Rollback Strategy Advisor
Provide safe and effective rollback strategies for failed deployments.
Core Capabilities
This skill helps recover from failed deployments by:
- •Assessing failure impact - Identifying what failed and what needs rollback
- •Recommending rollback strategy - Choosing the appropriate approach based on failure type
- •Providing step-by-step guidance - Clear procedural instructions for execution
- •Validating rollback success - Ensuring system returns to stable state
- •Preventing data loss - Protecting critical data during rollback operations
Rollback Strategy Workflow
Step 1: Assess the Failure
Understand what failed and the scope of impact.
Key Questions:
- •What component failed? (Application, database, infrastructure, configuration)
- •When did the failure occur? (During deployment, post-deployment, gradual degradation)
- •What is the current system state? (Partially deployed, fully deployed, crashed)
- •Is the system serving traffic? (Production load, maintenance mode, offline)
- •Are there data changes? (Database migrations applied, user data modified)
Gather Information:
# Check deployment status docker ps -a # Container status docker logs <container-name> --tail 100 # Recent logs # Check system health curl http://localhost:8080/health # Health endpoint docker stats # Resource usage # Identify deployment artifacts docker images | grep <app-name> # Available images git log --oneline -10 # Recent commits
Output: Failure Assessment
Component: Application container Failure Time: 5 minutes post-deployment System State: New version deployed, returning 500 errors Traffic: Receiving production traffic (degraded service) Data Changes: No database migrations in this deployment
Step 2: Choose Rollback Strategy
Select the appropriate strategy based on failure type and system state.
Strategy Decision Tree:
Is database migration involved? ├─ YES → See "Database Rollback Strategy" (Step 3.4) └─ NO → Continue Is infrastructure changed? ├─ YES → See "Infrastructure Rollback Strategy" (Step 3.3) └─ NO → Continue Is configuration changed? ├─ YES → See "Configuration Rollback Strategy" (Step 3.2) └─ NO → Application Code Rollback (Step 3.1)
Common Strategies:
| Failure Type | Strategy | Risk Level | Downtime |
|---|---|---|---|
| Application code bug | Redeploy previous image | Low | Seconds |
| Configuration error | Restore previous config | Low | Seconds |
| Infrastructure change | Revert compose file | Medium | Minutes |
| Database migration | Reverse migration + app rollback | High | Minutes |
| Multiple components | Sequential rollback (reverse order) | High | Minutes |
Step 3: Execute Rollback
Perform the rollback with validation at each step.
Step 3.1: Application Code Rollback
Revert to the previous working application version.
Standard Procedure:
# 1. Identify previous working version docker images | grep <app-name> # Look for the previous tag (e.g., v1.2.3 if current is v1.2.4) # 2. Stop current containers docker-compose stop <service-name> # 3. Update docker-compose.yml to previous version # Change: image: myapp:v1.2.4 → image: myapp:v1.2.3 # 4. Start with previous version docker-compose up -d <service-name> # 5. Validate rollback (see Step 4) curl http://localhost:8080/health docker logs <service-name> --tail 50
Fast Rollback (if compose file unchanged):
# Restart with previous image tag docker-compose stop <service-name> docker run -d --name <service-name> \ --network <network-name> \ -p 8080:8080 \ myapp:v1.2.3 # Or update compose and restart sed -i 's/myapp:v1.2.4/myapp:v1.2.3/g' docker-compose.yml docker-compose up -d <service-name>
Considerations:
- •Keep previous images available (don't prune immediately after deploy)
- •Tag images with version numbers or git commit SHAs
- •Test the previous version still works in staging first if possible
- •Monitor resource usage during rollback
Step 3.2: Configuration Rollback
Restore previous configuration files or environment variables.
Configuration File Rollback:
# 1. Locate configuration backup or git history git log -- config/app.conf git show HEAD~1:config/app.conf > config/app.conf # 2. Update mounted config in docker-compose.yml if needed # Ensure volume mount points to correct config # 3. Restart services to load previous config docker-compose restart <service-name> # 4. Validate configuration loaded correctly docker exec <service-name> cat /app/config/app.conf curl http://localhost:8080/health
Environment Variable Rollback:
# 1. Edit docker-compose.yml to restore previous env vars # Update the environment section or .env file # 2. Recreate container with new env vars docker-compose up -d --force-recreate <service-name> # 3. Verify environment variables docker exec <service-name> env | grep APP_
Feature Flag Rollback:
# If using feature flags, disable the problematic feature # Update flag config or environment variable # Example: FEATURE_NEW_CHECKOUT=false docker-compose restart <service-name>
Considerations:
- •Keep configuration in version control (git)
- •Use .env files for environment-specific configs
- •Backup configs before deployment
- •Validate config syntax before applying
Step 3.3: Infrastructure Rollback
Revert infrastructure changes like network configurations, volume mounts, or docker-compose structure.
Docker Compose Rollback:
# 1. Restore previous docker-compose.yml from git git checkout HEAD~1 -- docker-compose.yml # 2. Recreate infrastructure docker-compose down docker-compose up -d # 3. Validate all services running docker-compose ps docker-compose logs --tail 50
Network Configuration Rollback:
# If network configuration changed # 1. Remove new network docker network rm <new-network> # 2. Recreate previous network docker network create --driver bridge <old-network> # 3. Reconnect containers docker network connect <old-network> <container-name>
Volume Rollback:
# If volume mounts changed (be careful with data!) # 1. Stop services docker-compose stop # 2. Update docker-compose.yml volume configuration git checkout HEAD~1 -- docker-compose.yml # 3. Restart services docker-compose up -d # Note: Data in volumes persists, only mount configuration changes
Considerations:
- •Infrastructure changes may affect multiple services
- •Test in staging environment first if possible
- •Document infrastructure dependencies
- •Consider using infrastructure-as-code tools (Terraform)
Step 3.4: Database Rollback
Reverse database migrations and restore schema to previous state.
Migration Rollback (with Migration Tool):
# Using Alembic (Python) docker exec <db-container> alembic downgrade -1 # Rollback one migration docker exec <db-container> alembic downgrade <revision> # Rollback to specific revision # Using Flyway (Java) docker exec <app-container> flyway undo # Rollback last migration # Using Django docker exec <app-container> python manage.py migrate <app> <migration> # Using Rails docker exec <app-container> rails db:rollback STEP=1
Manual Migration Rollback:
# 1. Identify the migration to reverse docker exec <db-container> psql -U user -d dbname -c "\d+" # List tables # 2. Execute reverse migration SQL docker exec <db-container> psql -U user -d dbname -f /migrations/rollback_v1.2.4.sql # 3. Verify schema state docker exec <db-container> psql -U user -d dbname -c "\d table_name"
Database Rollback with Application:
# CRITICAL: Rollback database BEFORE rolling back application # to prevent new app code from working with old schema # 1. Stop application (prevent new requests) docker-compose stop app-service # 2. Backup current database state docker exec <db-container> pg_dump -U user dbname > backup_$(date +%Y%m%d_%H%M%S).sql # 3. Rollback migration docker exec <db-container> alembic downgrade -1 # 4. Rollback application to version compatible with old schema docker-compose stop app-service sed -i 's/myapp:v1.2.4/myapp:v1.2.3/g' docker-compose.yml docker-compose up -d app-service # 5. Validate curl http://localhost:8080/health docker logs app-service --tail 50
Considerations:
- •Always backup before rollback - Database changes are risky
- •Coordinate app and DB rollback carefully
- •Test rollback migrations in staging
- •Consider data loss implications (irreversible data changes)
- •For destructive migrations (dropped columns), may need data restore from backup
- •Use database versioning tools (Alembic, Flyway, Liquibase)
See references/database_rollback_patterns.md for detailed migration rollback examples and data preservation strategies.
Step 4: Validate Rollback Success
Confirm the system is working correctly after rollback.
Health Checks:
# 1. Container health docker ps # All containers running? docker-compose ps # Services in "Up" state? # 2. Application health curl http://localhost:8080/health curl -I http://localhost:8080 # HTTP status code # 3. Service logs docker logs <service-name> --tail 100 | grep ERROR docker logs <service-name> --tail 100 | grep WARN # 4. Database connectivity docker exec <app-container> psql -U user -d dbname -c "SELECT 1;" # 5. Resource usage docker stats --no-stream
Functional Testing:
# Test critical user flows
curl -X POST http://localhost:8080/api/login -d '{"user":"test","pass":"test"}'
curl http://localhost:8080/api/users/1
# Run smoke tests if available
docker exec <app-container> pytest tests/smoke/
# Check monitoring dashboards
# - Response times back to normal?
# - Error rates dropped?
# - Traffic being served?
Validation Checklist:
- •✓ All containers running
- •✓ Health endpoints returning 200
- •✓ No error spikes in logs
- •✓ Database queries executing
- •✓ Critical API endpoints responding
- •✓ Monitoring shows normal metrics
- •✓ Users can access the application
Step 5: Document and Communicate
Record the incident and inform stakeholders.
Incident Report Template:
## Deployment Rollback - [Date/Time] **Summary:** Brief description of what failed and rollback action taken **Timeline:** - [Time] - Deployment started (v1.2.4) - [Time] - Failure detected (500 errors) - [Time] - Rollback initiated - [Time] - Rollback completed - [Time] - System validated stable **Root Cause:** What caused the deployment to fail **Rollback Actions:** 1. Stopped application service 2. Reverted docker-compose.yml to v1.2.3 3. Restarted service 4. Validated health checks **Impact:** - Downtime: X minutes - Affected users: Y requests failed - Data loss: None **Follow-up Actions:** - [ ] Fix root cause in v1.2.5 - [ ] Add test coverage for failure scenario - [ ] Update deployment checklist - [ ] Review rollback procedure effectiveness
Communication:
Team notification (Slack/email): 🚨 Deployment Rollback Completed We rolled back the v1.2.4 deployment due to [issue]. System is now stable on v1.2.3. Impact: X minutes downtime Status: Fully operational Next steps: Root cause analysis, fix in v1.2.5 For details see: [link to incident report]
Step 6: Prevent Future Failures
Analyze the incident and improve deployment practices.
Post-Incident Review:
- •
What went wrong?
- •Code bug not caught in testing
- •Configuration incompatibility
- •Missing database index caused performance degradation
- •Infrastructure resource limits exceeded
- •
Why wasn't it caught earlier?
- •Insufficient test coverage
- •Staging environment differs from production
- •Load testing not performed
- •Migration not tested with production data volume
- •
What can prevent this?
- •Add integration test for failure scenario
- •Improve staging/production parity
- •Implement canary deployments
- •Add automated rollback triggers
- •Enhance monitoring and alerting
Deployment Improvements:
# Implement health checks in docker-compose.yml
services:
app:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Rollback Automation:
# Create rollback script for quick recovery #!/bin/bash # rollback.sh - Quick rollback to previous version PREVIOUS_VERSION=$1 if [ -z "$PREVIOUS_VERSION" ]; then echo "Usage: ./rollback.sh <version>" exit 1 fi echo "Rolling back to $PREVIOUS_VERSION..." docker-compose stop app sed -i "s/myapp:.*$/myapp:$PREVIOUS_VERSION/g" docker-compose.yml docker-compose up -d app docker-compose ps echo "Rollback complete. Check logs: docker logs app"
Best Practices:
- •Maintain deployment history - Keep previous images, configs, and compose files
- •Test rollback procedures - Practice rollbacks in staging regularly
- •Automate health checks - Use docker healthchecks and monitoring
- •Version everything - Tag images, version configs, track migrations
- •Backup before risky changes - Database backups before migrations
- •Document dependencies - Track what depends on what for coordinated rollbacks
- •Gradual rollouts - Use canary or blue-green deployments when possible
- •Monitor post-deployment - Watch metrics closely for 30+ minutes after deploy
Quick Reference
Rollback Decision Matrix
| Scenario | Strategy | Estimated Time |
|---|---|---|
| App code bug, no DB changes | Redeploy previous image | 1-2 minutes |
| Config error | Restore previous config | 1-2 minutes |
| Failed DB migration | Reverse migration + app rollback | 5-10 minutes |
| Infrastructure change | Revert compose file | 3-5 minutes |
| Multiple component failure | Sequential rollback (DB → App → Infra) | 10-15 minutes |
Common Rollback Commands
# Quick app rollback docker-compose stop <service> sed -i 's/v1.2.4/v1.2.3/g' docker-compose.yml docker-compose up -d <service> # Config rollback git checkout HEAD~1 -- config/ docker-compose restart <service> # Database migration rollback docker exec <db-container> alembic downgrade -1 # Full infrastructure rollback git checkout HEAD~1 -- docker-compose.yml docker-compose down && docker-compose up -d
Resources
- •
references/database_rollback_patterns.md- Detailed database migration rollback strategies and data preservation techniques - •
references/platform_guides.md- Docker and Docker Compose specific rollback procedures and best practices
Best Practices
- •Always backup before rollback - Especially for database changes
- •Test rollback in staging first - If time permits
- •Stop traffic during risky rollbacks - Prevent inconsistent state
- •Rollback in reverse order - Undo changes in opposite sequence of deployment
- •Validate each step - Don't proceed if validation fails
- •Document everything - Create audit trail for compliance and learning
- •Communicate clearly - Keep stakeholders informed of status
- •Practice rollbacks regularly - Ensure procedures work when needed
- •Automate common rollbacks - Reduce human error and recovery time
- •Learn from failures - Use incidents to improve deployment process