Cloud & Deployment Skill
Purpose
Guide safe, scalable, and reproducible deployments. This skill ensures that applications can be deployed consistently across environments, secrets are protected, and failures can be recovered from quickly.
Core Principles
What Cloud & Deployment Means
- •Reproducible - Same artifact deploys identically across environments
- •Secure - Secrets never in code, credentials properly managed
- •Observable - Deployment status visible and metrics available
- •Recoverable - Rollback capability for every deployment
- •Scalable - Can handle increased load with configuration
What Cloud & Deployment Is NOT
- •Vendor-specific - Avoid lock-in to single cloud providers
- •Secret-laden - No credentials in code or configuration files
- •Shortcut-prone - Production deployments follow same process as staging
- •Fragile - Deployments must be reliable and repeatable
Environment Configuration
Environment Types
| Environment | Purpose | Data | Access |
|---|---|---|---|
| Local | Development | Mocked/fake | Developer only |
| Development | Integration testing | Synthetic | Development team |
| Staging | Pre-production validation | Production-like | QA, Developers |
| Production | Live users | Real, anonymized | Operations only |
Configuration File Structure
yaml
# config/[environment].yaml
# Application Configuration
app:
name: todo-app
version: 1.0.0
environment: development # overridden by env var
# Service Endpoints (NOT credentials)
services:
api:
base_url: https://api.example.com
timeout: 30000
retry_attempts: 3
database:
host: localhost
port: 5432
name: todos_dev
# Feature Flags
features:
dark_mode: true
new_dashboard: false
# Rate Limiting
limits:
max_todos_per_user: 1000
api_rate_limit: 100
Environment-Specific Overrides
yaml
# config/development.yaml
app:
log_level: debug
environment: development
services:
api:
base_url: https://api-dev.example.com
# config/staging.yaml
app:
log_level: info
environment: staging
services:
api:
base_url: https://api-staging.example.com
# config/production.yaml
app:
log_level: warn
environment: production
services:
api:
base_url: https://api.example.com
Environment Variable Rules
| Category | Variable | Source | Required |
|---|---|---|---|
| App Config | APP_ENV | System | Yes |
| App Config | APP_LOG_LEVEL | System | No (default: info) |
| Database | DATABASE_URL | Secrets | Yes |
| Secrets | API_KEY | Vault | Yes |
| Services | EXTERNAL_SERVICE_URL | Config | Yes |
Twelve-Factor App Configuration
typescript
// All configuration in environment variables
const config = {
// Must be set by environment
databaseUrl: process.env.DATABASE_URL,
apiKey: process.env.API_KEY,
serviceEndpoint: process.env.SERVICE_ENDPOINT,
// Can have defaults, overridden by env
port: parseInt(process.env.PORT || '3000'),
logLevel: process.env.LOG_LEVEL || 'info',
};
Secrets and Credentials Handling
Secrets Definition
Secrets are NOT:
- •Configuration files
- •Hardcoded strings
- •Environment templates (.env.example)
- •Documentation
Secrets ARE:
- •API keys and tokens
- •Database passwords
- •Encryption keys
- •Service account credentials
- •OAuth client secrets
Secrets Storage Hierarchy
code
Most Secure → Least Secure 1. Hardware Security Module (HSM) 2. Cloud Key Management (AWS KMS, GCP Cloud KMS, Azure Key Vault) 3. Secret Management Service (HashiCorp Vault, AWS Secrets Manager) 4. Environment Variables (injected at runtime) 5. Configuration Files (encrypted at rest)
Secrets Handling Rules
| Rule | Enforcement |
|---|---|
| Never commit secrets | Pre-commit hooks scanning |
| Never log secrets | Log filtering rules |
| Never hardcode | Code review required |
| Rotate regularly | Automated rotation policies |
| Least privilege | IAM policies with minimum permissions |
| Audit access | Logging of secret retrieval |
Secrets Injection Pattern
yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: todo-app
spec:
template:
spec:
containers:
- name: app
image: todo-app:latest
envFrom:
- secretRef:
name: todo-app-secrets
- configMapRef:
name: todo-app-config
Environment-Specific Secrets
bash
# .env (NEVER committed) DATABASE_URL=postgresql://user:password@host:5432/db API_KEY=sk-xxxxxxxxxxxxxxx JWT_SECRET=your-jwt-secret-key REDIS_PASSWORD=redis-password
bash
# .env.example (committed, NO secrets) DATABASE_URL=postgresql://user:PASSWORD@host:5432/db API_KEY=YOUR_API_KEY_HERE JWT_SECRET=YOUR_JWT_SECRET_HERE REDIS_PASSWORD=YOUR_REDIS_PASSWORD_HERE
Secrets Rotation Strategy
| Secret Type | Rotation Frequency | Automation |
|---|---|---|
| Database passwords | 90 days | Automated |
| API keys | 180 days | Semi-automated |
| JWT secrets | 30 days | Automated |
| Service accounts | 90 days | Manual approval |
| TLS certificates | 90 days | Automated (Let's Encrypt) |
CI/CD Expectations
Pipeline Stages
mermaid
flowchart TD
A[Commit Code] --> B[Build]
B --> C[Unit Tests]
C --> D[Integration Tests]
D --> E[Security Scan]
E --> F[Build Container]
F --> G[Push to Registry]
G --> H[Deploy to Staging]
H --> E2E[End-to-End Tests]
E2E --> I[Deploy to Production]
I --> J[Health Check]
J --> K[Notify]
Pipeline Definition
yaml
# .github/workflows/deploy.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build:
runs-on: ubuntu-latest
outputs:
image: ${{ steps.build.outputs.image }}
steps:
- uses: actions/checkout@v4
- name: Build and test
run: |
npm ci
npm run build
npm test
- name: Security scan
run: |
npm audit
trivy image ${{ steps.build.outputs.image }}
- name: Build and push image
id: build
run: |
docker build -t $IMAGE_NAME:${{ github.sha }} .
docker push $IMAGE_NAME:${{ github.sha }}
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/todo-app \
todo-app=${{ needs.build.outputs.image }} \
-n staging
kubectl rollout status deployment/todo-app -n staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: |
kubectl set image deployment/todo-app \
todo-app=${{ needs.build.outputs.image }} \
-n production
kubectl rollout status deployment/todo-app -n production
Quality Gates
| Stage | Gate | Action on Fail |
|---|---|---|
| Build | Compilation | Fail pipeline |
| Unit Tests | > 80% coverage | Fail pipeline |
| Security Scan | No critical CVEs | Fail pipeline |
| Integration | All tests pass | Fail pipeline |
| Staging Deploy | Health check pass | Rollback |
| Production Deploy | Health check pass | Rollback |
Artifact Management
code
Artifacts/
├── images/
│ └── todo-app/
│ ├── sha-abc123 (latest)
│ ├── sha-def456
│ └── sha-ghi789
├── manifests/
│ └── releases/
│ ├── v1.0.0.yaml
│ └── v1.1.0.yaml
└── logs/
└── deployment/
├── 2024-01-15T10-00-00Z.log
└── 2024-01-15T11-00-00Z.log
Rollback and Failure Recovery
Rollback Strategy
mermaid
flowchart TD
A[Deployment Fails] --> B{Health Check Failed?}
B -->|Yes| C[Automatic Rollback]
B -->|No| D{Performance Degraded?}
D -->|Yes| E[Manual Rollback Decision]
D -->|No| F[Continue Deployment]
C --> G[Restore Previous Version]
E --> G
G --> H[Notify Team]
H --> I[Post-Mortem]
Rollback Methods
| Method | Speed | Use Case |
|---|---|---|
| Image rollback | Fast (seconds) | Container deployment |
| Config rollback | Fast (seconds) | Configuration change |
| Database rollback | Slow (minutes) | Schema migration |
| Full restore | Slow (minutes) | Major failure |
Rollback Commands
bash
# Kubernetes rollback kubectl rollout undo deployment/todo-app -n production kubectl rollout status deployment/todo-app -n production # Previous known good version kubectl rollout undo deployment/todo-app --to-revision=5 -n production # Check rollout history kubectl rollout history deployment/todo-app -n production
Recovery Time Objectives
| Scenario | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
|---|---|---|
| Complete outage | 15 minutes | 5 minutes |
| Performance degradation | 10 minutes | 5 minutes |
| Data corruption | 30 minutes | 0 (no data loss) |
| Security incident | 15 minutes | Depends on scope |
Failure Detection
yaml
# health-checks.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: health-checks
data:
liveness-probe: |
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readiness-probe: |
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startup-probe: |
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # 5 minutes max startup
Post-Mortem Template
markdown
## Incident Post-Mortem ### Summary | Field | Value | |-------|-------| | Incident ID | INC-2024-001 | | Severity | High | | Duration | 23 minutes | | Impact | 15% of users affected | ### Timeline | Time | Event | |------|-------| | 10:00 | Deployment started | | 10:02 | Health checks failed | | 10:05 | Automatic rollback initiated | | 10:23 | Service fully restored | ### Root Cause [Detailed explanation of what went wrong] ### Impact Analysis - Users affected: [Number] - Revenue impact: [Amount] - Reputation impact: [Assessment] ### Resolution [What was done to fix] ### Action Items | Action | Owner | Due Date | |--------|-------|----------| | Add additional health check | @developer | 2024-01-20 | | Update runbook | @sre | 2024-01-18 | | Improve test coverage | @qa | 2024-01-25 | ### Lessons Learned - [What went well] - [What went poorly] - [Process improvements]
Cloud-Native Best Practices
Container Best Practices
dockerfile
# Use specific version, not 'latest' FROM node:20-alpine # Set working directory WORKDIR /app # Copy only necessary files COPY package*.json ./ RUN npm ci --only=production # Copy source last (cache optimization) COPY . . # Non-root user for security RUN addgroup -g 1001 -S nodejs RUN adduser -S nextjs -u 1001 USER nextjs # Expose non-privileged port EXPOSE 8080 # Health check HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3 \ CMD wget --quiet --tries=1 --spider http://localhost:8080/health || exit 1 CMD ["node", "dist/index.js"]
Kubernetes Best Practices
yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: todo-app
labels:
app: todo-app
version: v1
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: todo-app
template:
metadata:
labels:
app: todo-app
version: v1
spec:
serviceAccountName: todo-app
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
containers:
- name: todo-app
image: todo-app:v1
ports:
- containerPort: 8080
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
envFrom:
- secretRef:
name: todo-secrets
- configMapRef:
name: todo-config
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Resource Management
| Resource | Request | Limit | Justification |
|---|---|---|---|
| CPU | 100m | 500m | Average load requires 100m, burst to 500m |
| Memory | 128Mi | 256Mi | Application needs 128Mi base, 256Mi peak |
| Storage | 1Gi | N/A | Logs and temp data |
Autoscaling
yaml
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: todo-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: todo-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Multi-Cloud Considerations
Avoiding Vendor Lock-In
| Component | Portable Approach | Avoid |
|---|---|---|
| Compute | Containers, Kubernetes | AWS Lambda specific, Cloud Functions |
| Storage | S3-compatible, PostgreSQL | DynamoDB-specific, Cloud SQL specific |
| Message Queue | AMQP, MQTT | SQS-specific, Pub/Sub-specific |
| Secrets | HashiCorp Vault | AWS Secrets Manager, GCP Secret Manager |
Abstraction Layer
typescript
// Use abstraction, not direct SDK calls
interface StorageClient {
upload(key: string, body: Buffer): Promise<void>;
download(key: string): Promise<Buffer>;
delete(key: string): Promise<void>;
}
// Implement for each provider
class S3StorageClient implements StorageClient { }
class GCSStorageClient implements StorageClient { }
class AzureStorageClient implements StorageClient { }
// Use based on environment
const storageClient = createStorageClient();
Infrastructure as Code
hcl
# main.tf (portable)
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
# Use variables for provider-specific values
variable "cloud_provider" {
default = "aws"
}
variable "region" {
default = "us-east-1"
}
Deployment Checklist
Pre-Deployment
- • Code reviewed and approved
- • Tests passing in CI
- • Security scan passed (no critical CVEs)
- • Documentation updated
- • Configuration reviewed for environment
- • Secrets rotated if needed
- • Backup completed (if database change)
- • Rollback plan verified
- • Stakeholders notified
During Deployment
- • Deployment started with tracking
- • Health checks passing
- • Logs monitoring for errors
- • Metrics dashboards visible
- • Rollback ready if needed
Post-Deployment
- • Health checks passing
- • Smoke tests passed
- • No increase in error rates
- • Performance within SLA
- • Documentation updated with version
- • Stakeholders notified of success
- • Monitoring configured for new version
Anti-Patterns
Anti-Pattern 1: Hardcoded Secrets
Forbidden:
typescript
// BAD: Secrets in code const apiKey = 'sk-1234567890abcdef'; const dbPassword = 'production_password';
Required:
typescript
// GOOD: Secrets from environment const apiKey = process.env.API_KEY; const dbPassword = process.env.DB_PASSWORD;
Anti-Pattern 2: Manual Configuration
Forbidden:
bash
# BAD: Manual kubectl editing kubectl edit deployment todo-app # ... make changes in editor
Required:
bash
# GOOD: Declarative configuration kubectl apply -f deployment.yaml # OR with Helm helm upgrade todo-app ./chart --values values-prod.yaml
Anti-Pattern 3: Same Process for All Environments
Forbidden:
bash
# BAD: Different commands for different environments npm run dev # local npm run stage # staging npm run prod # production
Required:
bash
# GOOD: Same process, different configuration docker build -t todo-app:$VERSION . docker tag todo-app:$VERSION $REGISTRY/todo-app:$VERSION docker push $REGISTRY/todo-app:$VERSION # Same deploy command for all environments
Anti-Pattern 4: No Rollback Plan
Forbidden:
yaml
# BAD: No previous image reference
spec:
template:
spec:
containers:
- name: app
image: todo-app:latest # No version!
Required:
yaml
# GOOD: Versioned, with rollback capability
spec:
template:
spec:
containers:
- name: app
image: todo-app:v1.2.3
# Previous versions retained in registry
Anti-Pattern 5: Prod Shortcuts
Forbidden:
bash
# BAD: Direct production access for debugging kubectl exec -it pod-xyz -n production -- /bin/bash # Making changes directly in prod
Required:
bash
# GOOD: Same process as development # Use observability tools (logs, metrics) for debugging kubectl logs pod-xyz -n production -f # Make changes through CI/CD pipeline
Security Checklist
Deployment Security
| Check | Requirement | Verification |
|---|---|---|
| Secrets | Never in code | Automated scan |
| TLS | All traffic encrypted | Security scan |
| RBAC | Least privilege | Policy review |
| Network | Restricted access | Firewall rules |
| Images | Scanned for CVEs | CI pipeline |
| Containers | Non-root user | Security scan |
Compliance Considerations
| Standard | Key Requirements | Frequency |
|---|---|---|
| SOC 2 | Access logging, encryption | Annual audit |
| GDPR | Data privacy, deletion | Continuous |
| HIPAA | PHI protection | Annual audit |
| PCI DSS | Payment data security | Quarterly scan |
Summary
Cloud and deployment best practices require:
- •Environment isolation - Separate configs for each environment
- •Secret protection - Never in code, properly injected
- •CI/CD discipline - Automated pipelines with quality gates
- •Rollback capability - Every deployment can be rolled back
- •Observability - Monitoring, logging, and alerting
- •No vendor lock-in - Portable, abstracted infrastructure
- •No shortcuts - Same process for all environments