Chaos Engineering Skill

Design and execute controlled failure experiments to validate system resilience.

Trigger Conditions

•Pre-release resilience validation needed
•Post-deploy verification of fault tolerance
•User invokes with "chaos experiment" or "resilience test"

Input Contract

•Required: System under test
•Required: Steady-state hypothesis (measurable)
•Optional: Blast radius constraints, failure types to inject

Output Contract

•Experiment definition with hypothesis
•Results report with pass/fail
•Findings and remediation recommendations
•Updated resilience scorecard

Tool Permissions

•Read: Service configs, circuit breaker configs, monitoring dashboards
•Write: Experiment logs, findings reports
•Execute: Failure injection tools (network, compute, storage)

Execution Steps

•Define steady-state hypothesis with measurable metrics
•Select failure injection type (network, pod kill, CPU, disk, dependency)
•Constrain blast radius (start small: single pod, single AZ)
•Execute experiment while monitoring steady state
•Observe and record system behavior
•Compare actual behavior against hypothesis
•Document findings and remediation

Success Criteria

•Hypothesis clearly defined before experiment
•Blast radius contained as planned
•Monitoring remained functional during experiment
•Findings documented with severity and remediation

Escalation Rules

•Escalate if experiment causes unexpected customer impact
•Escalate if monitoring fails during the experiment
•Escalate if recovery takes longer than MTTR target

Example Invocations

Input: "Test what happens when the Redis cache becomes unavailable"

Output: Hypothesis: API latency stays <500ms p99 with cache miss fallback to DB. Experiment: kill Redis pod. Result: FAIL — latency spiked to 3.2s, circuit breaker did not trip (misconfigured threshold). Remediation: lower circuit breaker threshold from 50% to 20% error rate, add cache stampede protection.