DevOps & Deployment Skill
Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.
When to Use
- •Setting up CI/CD pipelines
- •Containerizing applications
- •Deploying to Kubernetes or cloud platforms
- •Implementing GitOps workflows
- •Managing infrastructure as code
- •Planning release strategies
Pipeline Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Code │──▶│ Build │──▶│ Test │──▶│ Deploy │
│ Commit │ │ & Lint │ │ & Scan │ │ & Release │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Triggers Artifacts Reports Monitoring
Key Concepts
CI/CD Pipeline Stages
- •Lint & Type Check - Code quality gates
- •Unit Tests - Test coverage with reporting
- •Security Scan - npm audit + Trivy vulnerability scanner
- •Build & Push - Docker image to container registry
- •Deploy Staging - Environment-gated deployment
- •Deploy Production - Manual approval or automated
See
templates/github-actions-pipeline.ymlfor complete GitHub Actions workflow
Container Best Practices
Multi-stage builds minimize image size:
- •Stage 1: Install production dependencies only
- •Stage 2: Build application with dev dependencies
- •Stage 3: Production runtime with minimal footprint
Security hardening:
- •Non-root user (uid 1001)
- •Read-only filesystem where possible
- •Health checks for orchestrator integration
See
templates/Dockerfileandtemplates/docker-compose.yml
Kubernetes Deployment
Essential manifests:
- •Deployment with rolling update strategy
- •Service for internal routing
- •Ingress for external access with TLS
- •HorizontalPodAutoscaler for scaling
Security context:
- •
runAsNonRoot: true - •
allowPrivilegeEscalation: false - •
readOnlyRootFilesystem: true - •Drop all capabilities
Resource management:
- •Always set requests and limits
- •Use
requestsfor scheduling,limitsfor throttling
See
templates/k8s-manifests.yamlandtemplates/helm-values.yaml
Deployment Strategies
| Strategy | Use Case | Risk |
|---|---|---|
| Rolling | Default, gradual replacement | Low - automatic rollback |
| Blue-Green | Instant switch, easy rollback | Medium - double resources |
| Canary | Progressive traffic shift | Low - gradual exposure |
Rolling Update (Kubernetes default):
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0 # Zero downtime
Blue-Green: Deploy to standby environment, switch service selector Canary: Use Istio VirtualService for traffic splitting (10% → 50% → 100%)
Infrastructure as Code
Terraform patterns:
- •Remote state in S3 with DynamoDB locking
- •Module-based architecture (VPC, EKS, RDS)
- •Environment-specific tfvars files
See
templates/terraform-aws.tffor AWS VPC + EKS + RDS example
GitOps with ArgoCD
ArgoCD watches Git repository and syncs cluster state:
- •Automated sync with pruning
- •Self-healing (drift detection)
- •Retry policies for transient failures
See
templates/argocd-application.yaml
Secrets Management
Use External Secrets Operator to sync from cloud providers:
- •AWS Secrets Manager
- •HashiCorp Vault
- •Azure Key Vault
- •GCP Secret Manager
See
templates/external-secrets.yaml
Deployment Checklist
Pre-Deployment
- • All tests passing in CI
- • Security scans clean
- • Database migrations ready
- • Rollback plan documented
During Deployment
- • Monitor deployment progress
- • Watch error rates
- • Verify health checks passing
Post-Deployment
- • Verify metrics normal
- • Check logs for errors
- • Update status page
Helm Chart Structure
charts/app/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── hpa.yaml
│ └── _helpers.tpl
└── values/
├── staging.yaml
└── production.yaml
CI/CD Pipeline Patterns
Branch Strategy
Recommended: Git Flow with Feature Branches
main (production) ─────●────────●──────▶
┃ ┃
dev (staging) ─────●───●────●───●──────▶
┃ ┃
feature/* ─────────●────────┘
▲
└─ PR required, CI checks, code review
Branch protection rules:
- •
main: Require PR + 2 approvals + all checks pass - •
dev: Require PR + 1 approval + all checks pass - •Feature branches: No direct commits to main/dev
GitHub Actions Caching Strategy
- name: Cache Dependencies
uses: actions/cache@v3
with:
path: |
~/.npm
node_modules
backend/.venv
key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-deps-
Cache hit ratio impact:
- •Without cache: 2-3 min install time
- •With cache: 10-20 sec install time
- •~85% time savings on typical workflows
Artifact Management
# Build and upload artifact
- name: Build Application
run: npm run build
- name: Upload Build Artifact
uses: actions/upload-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/
retention-days: 7
# Download in deployment job
- name: Download Build Artifact
uses: actions/download-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/
Benefits:
- •Avoid rebuilding in deployment job
- •Deploy exact tested artifact (byte-for-byte match)
- •Retention policies prevent storage bloat
Matrix Testing
strategy:
matrix:
node-version: [18, 20, 22]
os: [ubuntu-latest, windows-latest]
jobs:
test:
runs-on: ${{ matrix.os }}
steps:
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- run: npm test
Container Optimization Deep Dive
Multi-Stage Build Example
# ============================================================ # Stage 1: Dependencies (builder) # ============================================================ FROM node:20-alpine AS deps WORKDIR /app COPY package*.json ./ RUN npm ci --only=production && npm cache clean --force # ============================================================ # Stage 2: Build (with dev dependencies) # ============================================================ FROM node:20-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci # Include dev dependencies COPY . . RUN npm run build && npm run test # ============================================================ # Stage 3: Production runtime (minimal) # ============================================================ FROM node:20-alpine AS runner WORKDIR /app # Security: Non-root user RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001 # Copy only production dependencies and built artifacts COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist COPY --chown=nodejs:nodejs package*.json ./ USER nodejs EXPOSE 3000 ENV NODE_ENV=production HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1 CMD ["node", "dist/main.js"]
Image size comparison:
- •Single-stage: 850 MB (includes dev dependencies, source files)
- •Multi-stage: 180 MB (only runtime + production deps)
- •78% reduction
Layer Caching Optimization
Order matters for cache efficiency:
# ❌ BAD: Invalidates cache on any code change COPY . . RUN npm install # ✅ GOOD: Cache package.json layer separately COPY package*.json ./ RUN npm ci # Cached unless package.json changes COPY . . # Source changes don't invalidate npm install
Security Scanning with Trivy
- name: Build Docker Image
run: docker build -t myapp:${{ github.sha }} .
- name: Scan for Vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload Scan Results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
- name: Fail on Critical Vulnerabilities
run: |
trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}
Kubernetes Production Patterns
Health Probes
Three probe types with distinct purposes:
spec:
containers:
- name: app
# Startup probe (gives slow-starting apps time to boot)
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
# Liveness probe (restarts pod if failing)
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3 # 3 failures = restart
# Readiness probe (removes from service if failing)
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2 # 2 failures = remove from load balancer
Probe implementation:
@app.get("/health/startup")
async def startup_check():
# Check DB connection established
if not db.is_connected():
raise HTTPException(status_code=503, detail="DB not ready")
return {"status": "ok"}
@app.get("/health/liveness")
async def liveness_check():
# Basic "is process running" check
return {"status": "alive"}
@app.get("/health/readiness")
async def readiness_check():
# Check all dependencies healthy
if not redis.ping() or not db.health_check():
raise HTTPException(status_code=503, detail="Dependencies unhealthy")
return {"status": "ready"}
PodDisruptionBudget
Prevents too many pods from being evicted during node maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: myapp
Use cases:
- •Cluster upgrades (node drains)
- •Autoscaler downscaling
- •Manual evictions
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "10" # Total CPU requests
requests.memory: 20Gi # Total memory requests
limits.cpu: "20" # Total CPU limits
limits.memory: 40Gi # Total memory limits
pods: "50" # Max pods
StatefulSets for Databases
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
# Pod spec here
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Key differences from Deployment:
- •Stable pod names (
postgres-0,postgres-1,postgres-2) - •Ordered deployment and scaling
- •Persistent storage per pod
Database Migration Strategies
Zero-Downtime Migration Pattern
Problem: Adding a NOT NULL column breaks old application versions
Solution: 3-phase migration
Phase 1: Add nullable column
-- Migration v1 (deploy with old code still running) ALTER TABLE users ADD COLUMN email VARCHAR(255);
Phase 2: Deploy new code + backfill
# New code writes to both old and new schema
def create_user(name: str, email: str):
# Write to new column
db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))
# Backfill existing rows
async def backfill_emails():
users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
for user in users_without_email:
email = generate_email(user.id)
await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))
Phase 3: Add constraint
-- Migration v2 (after backfill complete) ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Backward/Forward Compatibility
Backward compatible changes (safe):
- •✅ Add nullable column
- •✅ Add table
- •✅ Add index
- •✅ Rename column (with view alias)
Backward incompatible changes (requires 3-phase):
- •❌ Remove column
- •❌ Rename column (no alias)
- •❌ Add NOT NULL column
- •❌ Change column type
Rollback Procedures
# Helm rollback to previous revision helm rollback myapp 3 # Kubernetes rollback kubectl rollout undo deployment/myapp # Database migration rollback (Alembic example) alembic downgrade -1
Critical: Test rollback procedures regularly!
Observability & Monitoring
Prometheus Metrics Exposition
from prometheus_client import Counter, Histogram, generate_latest
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
# Record metrics
http_requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
http_request_duration_seconds.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(content=generate_latest(), media_type="text/plain")
Grafana Dashboard Queries
# Request rate (requests per second)
rate(http_requests_total[5m])
# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)
Alerting Rules
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency detected"
description: "p95 latency is {{ $value }}s"
Real-World Examples
Example 1: E-commerce Platform with Docker Compose
Full-stack e-commerce development environment:
version: '3.8'
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: shopuser
POSTGRES_PASSWORD: dev_password
POSTGRES_DB: shop_dev
ports:
- "5437:5432" # Avoid conflict with host postgres
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U shopuser"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
backend:
build:
context: ./backend
dockerfile: Dockerfile.dev
ports:
- "8000:8000"
environment:
DATABASE_URL: postgresql://shopuser:dev_password@postgres:5432/shop_dev
REDIS_URL: redis://redis:6379
STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
volumes:
- ./backend:/app # Hot reload
- /app/.venv # Persist virtualenv
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.dev
ports:
- "3000:3000"
environment:
VITE_API_URL: http://localhost:8000
VITE_STRIPE_PUBLIC_KEY: ${VITE_STRIPE_PUBLIC_KEY}
volumes:
- ./frontend:/app
- /app/node_modules # Avoid overwriting node_modules
worker:
build:
context: ./backend
dockerfile: Dockerfile.dev
command: celery -A app.worker worker --loglevel=info
environment:
DATABASE_URL: postgresql://shopuser:dev_password@postgres:5432/shop_dev
REDIS_URL: redis://redis:6379
depends_on:
- postgres
- redis
volumes:
- ./backend:/app
volumes:
pgdata:
redisdata:
Key patterns:
- •Port mapping to avoid host conflicts (5437:5432)
- •Health checks before dependent services start
- •Volume mounts for hot reload during development
- •Named volumes for data persistence
Example 2: GitHub Actions CI/CD Pipeline (2025)
Modern FastAPI backend with Ruff and uv:
name: Backend CI/CD
on:
push:
branches: [main, dev]
paths: ['backend/**']
pull_request:
branches: [main, dev]
paths: ['backend/**']
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint-and-test:
runs-on: ubuntu-24.04-latest # Latest LTS runner
defaults:
run:
working-directory: backend
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: test_db
ports: ["5432:5432"]
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7-alpine
ports: ["6379:6379"]
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
steps:
- uses: actions/checkout@v4
- name: Setup Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/uv
key: ${{ runner.os }}-uv-${{ hashFiles('backend/uv.lock') }}
- name: Install dependencies
run: uv sync --frozen --no-dev
- name: Lint - Ruff format check
run: uv run ruff format --check app/
- name: Lint - Ruff code check
run: uv run ruff check app/
- name: Type check with mypy
run: uv run mypy app/ --strict --ignore-missing-imports
- name: Run tests with coverage
env:
DATABASE_URL: postgresql://postgres:test@localhost:5432/test_db
REDIS_URL: redis://localhost:6379
run: uv run pytest tests/ --cov=app --cov-report=xml --cov-report=term -v
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
file: ./backend/coverage.xml
flags: backend
token: ${{ secrets.CODECOV_TOKEN }}
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: 'backend/'
severity: 'CRITICAL,HIGH'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'trivy-results.sarif'
docker-build:
runs-on: ubuntu-latest
needs: [lint-and-test, security-scan]
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: ./backend
push: true
tags: |
ghcr.io/${{ github.repository }}/backend:latest
ghcr.io/${{ github.repository }}/backend:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
Key features:
- •Latest GitHub Actions runners (ubuntu-24.04)
- •uv for fast Python package management
- •Service containers for integration tests
- •Concurrency control to cancel outdated runs
- •Docker Buildx with GitHub Actions cache
- •SARIF upload for security findings
Example 3: Database Migration with Alembic (2025)
E-commerce order tracking migration:
# backend/alembic/versions/2025_12_29_add_order_tracking.py
"""Add order tracking with shipment provider
Revision ID: a1b2c3d4e5f6
Revises: prev_revision_id
Create Date: 2025-12-29 10:30:00
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
# Revision identifiers
revision = 'a1b2c3d4e5f6'
down_revision = 'prev_revision_id'
branch_labels = None
depends_on = None
def upgrade():
# Add tracking column as nullable first (backward compatible)
op.add_column('orders',
sa.Column('tracking_number', sa.String(100), nullable=True)
)
op.add_column('orders',
sa.Column('shipping_provider', sa.String(50), nullable=True)
)
op.add_column('orders',
sa.Column('shipped_at', sa.DateTime(timezone=True), nullable=True)
)
# Add composite index for tracking lookups
op.create_index(
'idx_orders_tracking',
'orders',
['tracking_number', 'shipping_provider'],
unique=False
)
# Add check constraint for valid providers
op.create_check_constraint(
'ck_orders_shipping_provider',
'orders',
"shipping_provider IN ('USPS', 'UPS', 'FedEx', 'DHL')"
)
def downgrade():
op.drop_constraint('ck_orders_shipping_provider', 'orders', type_='check')
op.drop_index('idx_orders_tracking', 'orders')
op.drop_column('orders', 'shipped_at')
op.drop_column('orders', 'shipping_provider')
op.drop_column('orders', 'tracking_number')
Migration workflow with uv (2025):
# Create new migration (auto-generate from models) uv run alembic revision --autogenerate -m "Add order tracking with shipment provider" # ALWAYS review the generated migration! cat alembic/versions/*_add_order_tracking.py # Test migration in development uv run alembic upgrade head # Verify schema changes uv run alembic current psql $DATABASE_URL -c "\d orders" # Rollback if issues found uv run alembic downgrade -1 # Check migration history uv run alembic history --verbose
Production migration checklist:
# 1. Backup database pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql # 2. Test migration on staging with production data snapshot uv run alembic upgrade head # 3. Monitor query performance EXPLAIN ANALYZE SELECT * FROM orders WHERE tracking_number = 'TRACK123'; # 4. Deploy with zero downtime (nullable columns) # Phase 1: Add nullable columns # Phase 2: Backfill data # Phase 3: Add NOT NULL constraint (if needed)
Extended Thinking Triggers
Use Opus 4.5 extended thinking for:
- •Architecture decisions - Kubernetes vs serverless, multi-region setup
- •Migration planning - Moving between cloud providers
- •Incident response - Complex deployment failures
- •Security design - Zero-trust architecture
Templates Reference
| Template | Purpose |
|---|---|
github-actions-pipeline.yml | Full CI/CD workflow with 6 stages |
Dockerfile | Multi-stage Node.js build |
docker-compose.yml | Development environment |
k8s-manifests.yaml | Deployment, Service, Ingress |
helm-values.yaml | Helm chart values |
terraform-aws.tf | VPC, EKS, RDS infrastructure |
argocd-application.yaml | GitOps application |
external-secrets.yaml | Secrets Manager integration |