Operations
Production readiness evaluation focused on resilience, observability, and incident response.
Resilience
Failure Modes
- •What can fail? List all external dependencies
- •Blast radius: If X fails, what else breaks?
- •Graceful degradation: Partial failure ≠ total failure?
Patterns
| Pattern | Purpose | Check |
|---|---|---|
| Timeouts | Prevent hung connections | Every external call has one? |
| Circuit Breaker | Stop cascading failures | On critical paths? |
| Bulkhead | Isolate failures | Separate thread pools? |
| Retry | Handle transient failures | With backoff? Bounded? |
Observability
The RED Method
| Metric | What | Why |
|---|---|---|
| Rate | Requests per second | Traffic understanding |
| Errors | Failed requests | Problem detection |
| Duration | Latency distribution | Performance tracking |
Logging
- •Structured (JSON, not free text)
- •Correlation IDs across services
- •Appropriate levels (not everything is ERROR)
- •PII redaction
Tracing
- •Distributed tracing enabled?
- •Spans for all external calls?
- •Context propagation working?
Capacity
- •Scaling: Horizontal preferred, auto-scaling configured?
- •Limits: Memory, CPU, connections all bounded?
- •Backpressure: What happens at 2x load? 10x?
- •Rate Limiting: Per-tenant/client quotas?
Security Posture
- •Secrets: In vault, not env vars or code
- •Network: Least privilege, mTLS where possible
- •Dependencies: Vulnerability scanning in CI
- •Access: Audit logging for sensitive operations
Incident Readiness
- •Runbooks: Documented recovery procedures
- •On-call: Rotation defined, escalation clear
- •Rollback: One-click, tested regularly
- •Communication: Status page, stakeholder notification
Checklist
code
□ All external calls have timeouts □ Circuit breakers on critical paths □ Structured logging with correlation IDs □ RED metrics exposed □ Alerts are actionable (not noisy) □ Auto-scaling configured with limits □ Graceful shutdown implemented □ Health checks (liveness + readiness) □ Secrets in vault □ Runbook exists □ Rollback tested
Guild Members for Operations
Primary: Taleb (resilience), Erlang (capacity), Vector (security) Secondary: Lamport (distributed failure), Ixian (metrics/validation)
Additional Resources
- •
references/chaos-patterns.md— Chaos engineering patterns and failure injection