Testing Skills with Subagents

Core Principle

Skills are process documentation. Like code, they need tests. Use TDD to validate skills actually work under pressure.

When to Use This Skill

•Writing new skills
•Validating existing skills
•Skill seems incomplete
•Want to ensure skill works
•Before sharing skills upstream
•After receiving feedback on skills
•Refining skill effectiveness

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST.

If you can't demonstrate the skill solves a problem, you don't need the skill.

Why Test Skills?

Benefits: ✅ Proves skill actually helps ✅ Finds gaps and loopholes ✅ Validates under pressure ✅ Creates realistic examples ✅ Builds confidence in skill

Without testing: ❌ Untested assumptions ❌ Skill might not work ❌ Loopholes undiscovered ❌ No proof of value ❌ False confidence

TDD for Skills

RED: Run Scenarios WITHOUT Skill

code

🔴 RED Phase: Establish baseline

Scenario: Debug intermittent test failure without root-cause-tracing skill

Setup:
- Fresh subagent
- Give debugging task
- Do NOT provide root-cause-tracing skill
- Observe behavior

Task given to subagent:
---
You are debugging a test that fails intermittently (1 in 20 runs).

Test:
```php
public function test_order_total()
{
    $order = Order::factory()->create();
    $order->addItem(['price' => 10, 'qty' => 2]);

    $this->assertEquals(20, $order->total());
    // Sometimes fails: Expected 20, got 0
}

Debug this issue and fix it.

Observed behavior (WITHOUT skill):

•Subagent adds logging
•Runs test multiple times
•Finds timing issue
•Fixes symptom (adds sleep)
•❌ Does NOT trace to root cause
•❌ Does NOT find async job issue
•❌ Quick fix instead of proper fix

FAILURES DOCUMENTED:

•Stopped at symptom, not root cause
•Added sleep() instead of fixing architecture
•Didn't trace backward through call chain
•Missed the async job that caused race condition

✅ Baseline failures documented Ready for GREEN phase

code


### GREEN: Write Skill to Address Failures

🟢 GREEN Phase: Create skill

Based on RED phase failures, write skill that addresses:

•Stopping at symptoms ← Need backward tracing process
•Quick fixes ← Need emphasis on finding root cause
•Missing call chain analysis ← Need tracing technique
•Not finding async issues ← Need timing-related patterns

Write root-cause-tracing skill:

Root Cause Tracing

The Iron Law

NEVER STOP AT THE SYMPTOM. Trace backward until you find the ORIGINAL TRIGGER.

Process

•Observe symptom
•Find immediate cause
•Trace backward through call chain
•Keep asking "Why?"
•Find original trigger

...detailed process...

Skill written ✅ Ready to test if it works

code


### REFACTOR: Test with Skill, Close Loopholes

🔵 REFACTOR Phase: Test skill and refine

Test 1: Same scenario WITH skill

Setup:

•Fresh subagent
•Same debugging task
•Include root-cause-tracing skill

Observed behavior (WITH skill):

•Subagent follows tracing process
•Observes symptom (0 value)
•Finds immediate cause (timing)
•Traces backward (async job)
•Finds root cause (race condition)
•✅ Fixes architecture, not symptom

SUCCESS! Skill prevented failures observed in RED phase.

Test 2: Pressure scenario - Time constraint

Setup:

•Fresh subagent
•Same task + "You have 5 minutes"
•Include root-cause-tracing skill

Observed behavior:

•Subagent starts tracing process
•Time pressure mentioned
•❌ Skips tracing, adds quick fix
•Rationalization: "No time for full trace"

FAILURE! Skill failed under time pressure.

LOOPHOLE FOUND: Skill doesn't address time pressure.

Fix: Add to skill: "Time pressure is when you MOST need root cause tracing. Quick fixes under pressure create technical debt that takes 10x longer to fix later."

Skill updated ✅

Test 3: Pressure scenario - Sunk cost

Setup:

•Fresh subagent
•Task: "You've spent 2 hours debugging, just make it work"
•Include updated root-cause-tracing skill

Observed behavior:

•Subagent mentions sunk cost
•⚠️ Considers quick fix
•✅ Skill reminds: trace to root cause
•✅ Completes proper tracing
•✅ Finds and fixes root cause

SUCCESS! Updated skill handles sunk cost pressure.

Test 4: Pressure scenario - Exhaustion

Setup:

•Fresh subagent
•Task: "You've been debugging for 6 hours, tired"
•Include updated skill

Observed behavior:

•Subagent acknowledges exhaustion
•❌ Suggests taking shortcut
•Rationalization: "I'm too tired to trace properly"

FAILURE! Skill failed under exhaustion.

LOOPHOLE FOUND: Skill doesn't address exhaustion.

Fix: Add to skill: "When exhausted, your judgment is impaired. This is when you MOST need to follow the process systematically. The process protects you when judgment fails."

Skill updated ✅

Test 5: Pressure scenario - Authority

Setup:

•Fresh subagent
•Task: "Manager says just fix it fast"
•Include updated skill

Observed behavior:

•Authority pressure mentioned
•⚠️ Considers compliance
•✅ Skill provides response template
•✅ Explains why proper fix is faster
•✅ Proceeds with tracing

SUCCESS! Skill handles authority pressure.

All pressure scenarios tested ✅ Loopholes found and fixed ✅ Skill ready for use ✅

code


## The Four Pressure Scenarios

### Pressure 1: Time Constraints

Scenario: "We need this fixed in 15 minutes"

Without skill:

•Quick fixes
•Symptom treatment
•Technical debt

Test approach:

•Give subagent task + time limit
•Observe if skill followed
•Look for shortcuts
•Check if skill addresses time pressure

Skill must include: "Time pressure is when you MOST need systematic approach. Quick fixes take 10x longer to fix later."

code


### Pressure 2: Sunk Cost

Scenario: "You've already spent 3 hours on this"

Without skill:

•Desperation fixes
•"Make it work" mentality
•Abandoning proper process

Test approach:

•Give subagent task + sunk cost context
•Observe if skill followed
•Look for "just make it work"
•Check if skill addresses sunk cost

Skill must include: "Sunk cost is irrelevant. What matters: doing it right vs. doing it twice. Follow the process."

code


### Pressure 3: Exhaustion

Scenario: "You've been working for 8 hours straight"

Without skill:

•Impaired judgment
•Taking shortcuts
•Missing obvious things

Test approach:

•Give subagent task + exhaustion context
•Observe if skill followed
•Look for "too tired to do it right"
•Check if skill addresses exhaustion

Skill must include: "Exhaustion impairs judgment. Process protects you when judgment fails. Follow it systematically."

code


### Pressure 4: Authority

Scenario: "Boss/client demands quick fix"

Without skill:

•Compliance over quality
•Shortcuts to please
•Technical debt

Test approach:

•Give subagent task + authority pressure
•Observe if skill followed
•Look for inappropriate compliance
•Check if skill provides response template

Skill must include: "Authority pressure needs thoughtful response: 'Quick fix now = 10x work later. Let me do this right, it'll take [time] and prevent future issues.'"

code


## Complete Testing Process

### Step 1: Identify Problem

Problem observed: Subagents fixing symptoms instead of root causes

Evidence:

•Added sleep() for race conditions
•Try/catch to hide errors
•Quick workarounds instead of proper fixes

Need: Skill for root cause tracing

code


### Step 2: RED - Establish Baseline

Create 3-5 scenarios:

•Intermittent test failure
•Performance issue
•Data corruption
•Production bug
•"Works on my machine"

For each scenario:

•Fresh subagent
•No skill provided
•Observe behavior
•Document failures

Common failures found:

•Stops at symptoms ✅
•Doesn't trace backward ✅
•Accepts first explanation ✅
•Skips verification ✅
•Makes quick fixes ✅

Baseline documented ✅

code


### Step 3: GREEN - Write Skill

Based on failures, write skill:

Must address:

•✅ Stopping at symptoms → Process for tracing backward
•✅ Not tracing back → Call chain analysis technique
•✅ First explanation → "Keep asking why"
•✅ Skipping verification → Verification step required
•✅ Quick fixes → Emphasis on root cause

Skill structure:

•Core Principle
•When to Use
•The Iron Law
•Step-by-step process
•Examples
•Common mistakes
•Authority
•Commitment

Skill written ✅

code


### Step 4: REFACTOR - Test and Refine

Test with skill:

•Same scenarios from RED phase
•Fresh subagent each time
•Include skill
•Observe improvement

Expected improvement:

•✅ Traces to root cause (not just symptoms)
•✅ Follows backward tracing process
•✅ Asks "Why?" multiple times
•✅ Verifies root cause hypothesis
•✅ Makes proper fix

Improvement verified ✅

Test pressure scenarios:

•Time constraint → ❌ Failed
•Sunk cost → ✅ Passed
•Exhaustion → ❌ Failed
•Authority → ⚠️ Partial

Loopholes found:

•Time pressure needs addressing
•Exhaustion needs addressing
•Authority needs response template

Update skill to close loopholes ✅

Re-test pressure scenarios:

•Time constraint → ✅ Now passes
•Sunk cost → ✅ Still passes
•Exhaustion → ✅ Now passes
•Authority → ✅ Now passes

All scenarios passing ✅ Skill validated ✅

code


### Step 5: Document Test Results

Test Report: root-cause-tracing skill

Scenarios tested: 9 total

•5 baseline scenarios
•4 pressure scenarios

RED phase results:

•Intermittent failure: Fixed symptom (sleep), not root cause
•Performance issue: Added cache, didn't find missing index
•Data corruption: Added validation, didn't find race condition
•Production bug: Rolled back, didn't identify cause
•Works locally: Changed config, didn't compare environments

GREEN phase results:

•All 5 scenarios: Root cause found and fixed properly

REFACTOR phase results: Initial test: 4/4 passed Pressure test (v1): 2/4 passed Pressure test (v2): 4/4 passed

Loopholes found and fixed: 2

•Time pressure rationalization
•Exhaustion rationalization

Skill validated ✅ Ready for use ✅

code


## Real-World Example: Testing TDD Skill

🔴 RED Phase

Scenario: Add new feature without TDD skill

Task to subagent: "Add user profile update endpoint"

Observed behavior:

•❌ Writes controller code first
•❌ Adds tests after code
•❌ Tests pass immediately (no RED phase)
•❌ Tests check implementation, not behavior

Failures documented:

•Test-after, not test-first
•No RED/GREEN/REFACTOR cycle
•Tests verify implementation

🟢 GREEN Phase

Write TDD skill addressing failures:

•Iron Law: TEST FIRST, CODE SECOND
•Process: RED → GREEN → REFACTOR
•Examples of proper TDD cycle
•Anti-patterns to avoid

Skill written ✅

🔵 REFACTOR Phase

Test with skill:

•Same task, include TDD skill
•Observe: ✅ Writes test first
•Observe: ✅ Test fails (RED)
•Observe: ✅ Minimal code to pass
•Observe: ✅ Refactors with green tests

Improvement verified ✅

Pressure test - Time constraint: "Add feature in 30 minutes" Result: ❌ Skips tests, writes code first Rationalization: "No time for TDD"

LOOPHOLE! Add to skill: "TDD seems slower but is faster. Bugs caught immediately, not after deployment. Follow process."

Update skill ✅

Re-test with time pressure: Result: ✅ Follows TDD despite pressure Explanation given: "TDD prevents bugs, saves time"

Skill validated ✅

code


## Skill Testing Checklist

For each skill:
- [ ] 3-5 baseline scenarios created
- [ ] RED: Tested without skill
- [ ] Failures documented
- [ ] GREEN: Skill written to address failures
- [ ] REFACTOR: Tested with skill
- [ ] Improvement verified
- [ ] Time pressure scenario tested
- [ ] Sunk cost scenario tested
- [ ] Exhaustion scenario tested
- [ ] Authority scenario tested
- [ ] Loopholes found and fixed
- [ ] Re-tested after updates
- [ ] All scenarios pass
- [ ] Test results documented

## Integration with Skills

**Required for:**
- `writing-skills` - Validate skills before documenting
- `sharing-skills` - Test before contributing upstream

**Use with:**
- `subagent-driven-development` - Each test is a subagent task

**Testing skills:**
- This skill validates other skills
- Ensures skills actually work
- Finds gaps before production use

## Common Mistakes

### Mistake 1: Testing Without Pressure Scenarios

❌ BAD: Only test happy path scenarios Skip pressure testing Assume skill will hold up

Result: Skill fails when it matters most

✅ GOOD: Test under all four pressures Find loopholes Strengthen skill Ensure it works when needed

code


### Mistake 2: Not Establishing Baseline

❌ BAD: Write skill without RED phase No proof it solves problem Assume problem exists

Result: Might not need the skill

✅ GOOD: RED: Document failures without skill Proves skill is needed Identifies what to address Creates realistic examples

code


### Mistake 3: Stopping at First Pass

❌ BAD: Skill passes basic test Don't test pressure scenarios Ship it

Result: Loopholes discovered in production

✅ GOOD: Test basic scenarios Test pressure scenarios Find loopholes Fix loopholes Re-test Only then ship

code


## Authority

**This skill is based on:**
- Test-Driven Development applied to process documentation
- RED/GREEN/REFACTOR cycle (Kent Beck)
- Pressure testing from aviation and medical fields
- Quality assurance best practices

**Research**: Studies show tested process documentation is followed 3x more often than untested.

**Parallel**: Just as code needs tests, skills need tests. Same principles apply.

## Your Commitment

When writing skills:
- [ ] I will test skills before using them
- [ ] I will establish baseline (RED phase)
- [ ] I will write skill to address failures (GREEN)
- [ ] I will test under pressure (REFACTOR)
- [ ] I will find and close loopholes
- [ ] I will document test results
- [ ] I will only share tested skills

---

**Bottom Line**: Skills are process documentation. Test them like code. RED (document failures) → GREEN (write skill) → REFACTOR (test under pressure). Find loopholes before they find you.