Chaos Engineering
Principles
- •Build a Hypothesis: Define expected behavior
- •Minimize Blast Radius: Start small
- •Run in Production: Real conditions matter
- •Automate: Make experiments repeatable
- •Minimize Impact: Have abort conditions
Experiment Process
- •Steady State: Define normal metrics
- •Hypothesis: "System will maintain X under condition Y"
- •Introduce Variables: Inject failure
- •Observe: Compare to steady state
- •Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
bash
# Add latency tc qdisc add dev eth0 root netem delay 100ms # Packet loss tc qdisc add dev eth0 root netem loss 10% # Remove tc qdisc del dev eth0 root
Resource Exhaustion
bash
# CPU stress stress --cpu 4 --timeout 60s # Memory stress stress --vm 2 --vm-bytes 1G --timeout 60s # Disk fill dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
- •Kill processes
- •Restart containers
- •Terminate instances
- •Block dependencies
Chaos Tools
- •Chaos Monkey: Random instance termination
- •Gremlin: Comprehensive chaos platform
- •Litmus: Kubernetes chaos engineering
- •Chaos Mesh: Cloud-native chaos
Experiment Template
markdown
## Experiment: [Name] ### Hypothesis If [condition], then [expected behavior]. ### Steady State - Metric A: [baseline value] - Metric B: [baseline value] ### Method 1. [Step 1] 2. [Step 2] 3. [Step 3] ### Abort Conditions - If [condition], stop immediately ### Results [What happened] ### Findings [What we learned]
Safety Rules
- •Start in non-production
- •Have rollback ready
- •Monitor continuously
- •Communicate with team
- •Document everything