Testing Guide

What to test, how to test it, and what NOT to test — for a plugin made of prompt files, Python glue, and configuration.

Philosophy: GenAI-First Testing

Traditional unit tests work for deterministic logic. But most bugs in this project are drift — docs diverge from code, agents contradict commands, component counts go stale. GenAI congruence tests catch these. Unit tests don't.

Decision rule: Can you write assert x == y and it won't break next week? → Unit test. Otherwise → GenAI test or structural test.

Three Test Patterns

1. Judge Pattern (single artifact evaluation)

An LLM evaluates one artifact against criteria. Use for: doc completeness, security posture, architectural intent.

python

pytestmark = [pytest.mark.genai]

def test_agents_documented_in_claude_md(self, genai):
    agents_on_disk = list_agents()
    claude_md = Path("CLAUDE.md").read_text()
    result = genai.judge(
        question="Does CLAUDE.md document all active agents?",
        context=f"Agents on disk: {agents_on_disk}\nCLAUDE.md:\n{claude_md[:3000]}",
        criteria="All active agents should be referenced. Score by coverage %."
    )
    assert result["score"] >= 5, f"Gap: {result['reasoning']}"

2. Congruence Pattern (two-source cross-reference)

The most valuable pattern. An LLM checks two files that should agree. Use for: command↔agent alignment, FORBIDDEN lists, config↔reality.

python

def test_implement_and_implementer_share_forbidden_list(self, genai):
    implement = Path("commands/implement.md").read_text()
    implementer = Path("agents/implementer.md").read_text()
    result = genai.judge(
        question="Do these files have matching FORBIDDEN behavior lists?",
        context=f"implement.md:\n{implement[:5000]}\nimplementer.md:\n{implementer[:5000]}",
        criteria="Both should define same enforcement gates. Score 10=identical, 0=contradictory."
    )
    assert result["score"] >= 5

Analytic Rubric Pattern (decomposed per-criterion evaluation)

More reliable than holistic scoring. Each criterion is evaluated independently with a binary MET/UNMET judgment. Use for: security posture, enforcement quality, multi-faceted assessments.

python

def test_security_posture_analytic(self, genai):
    result = genai.judge_analytic(
        question="Evaluate the security posture of this codebase",
        context=f"Hook samples:\n{hook_content[:5000]}",
        criteria=[
            {"name": "No hardcoded secrets", "description": "No real API keys or tokens in source", "max_points": 1},
            {"name": "Named exit codes", "description": "Hooks use named constants, not bare numbers", "max_points": 1},
            {"name": "Path validation", "description": "File operations validate paths", "max_points": 1},
        ],
    )
    assert result["total_score"] >= 2, f"{result['total_score']}/{result['max_score']}: {result['reasoning']}"

Return value: {"criteria_results": [...], "total_score": N, "max_score": N, "pass": bool, "band": str, "reasoning": str}

When to use: Multi-faceted evaluations where you need to know which specific criteria passed or failed. Each criterion gets its own LLM call for independent judgment.

Consistency Check Pattern (multi-round agreement)

For high-stakes judgments where a single LLM evaluation might be unreliable. Runs multiple rounds and checks for agreement. Uses median score as the final result.

python

def test_pipeline_completeness_consistent(self, genai):
    result = genai.judge_consistent(
        question="Does implement.md define a complete SDLC pipeline?",
        context=f"implement.md:\n{content[:6000]}",
        criteria="Pipeline should have research, plan, test, implement, review, security, docs steps.",
        rounds=3,
    )
    assert result["final_score"] >= 7, f"median={result['final_score']}, agreement={result['agreement']}"

Return value: {"rounds": [...], "agreement": bool, "scores": [...], "final_score": median, "pass": bool, "band": str, "reasoning": str}

When to use: Critical assessments where false positives/negatives are costly. Agreement=False signals the evaluation needs human review.

Temperature Guidance

All ask() calls default to temperature=0 for deterministic, reproducible judging. Override only when you need creative/diverse outputs:

python

# Default: temperature=0 (deterministic judging)
response = genai.ask("Evaluate this code", temperature=0)

# Override for creative tasks like edge case generation
response = genai.ask("Generate unusual test inputs", temperature=0.7)

3. Cross-Validation Pattern (two sources that must match)

No LLM needed. When two configs/files must stay in sync, read both and compare directly. Catches the #1 recurring bug class: adding something to one place but not the other.

python

def test_policy_and_hook_in_sync(self):
    """Policy always_allowed and hook NATIVE_TOOLS must be identical."""
    policy_tools = set(json.load(open(POLICY_FILE))["tools"]["always_allowed"])
    hook_tools = hook.NATIVE_TOOLS
    # Check BOTH directions
    assert policy_tools - hook_tools == set(), f"In policy not hook: {policy_tools - hook_tools}"
    assert hook_tools - policy_tools == set(), f"In hook not policy: {hook_tools - policy_tools}"

When to use: Any time two files define overlapping data — permissions↔hook, manifest↔disk, config↔worktree copy, command frontmatter↔policy. Key principle: Read both sources dynamically. Never hardcode expected values in the test itself.

4. Structural Pattern (dynamic filesystem discovery)

No LLM needed. Discover components dynamically and assert structural properties. Use for: component existence, manifest sync, skill loading.

python

def test_all_active_skills_have_content(self):
    skills_dir = Path("plugins/autonomous-dev/skills")
    for skill in skills_dir.iterdir():
        if skill.name == "archived" or not skill.is_dir():
            continue
        skill_md = skill / "SKILL.md"
        assert skill_md.exists(), f"Skill {skill.name} missing SKILL.md"
        assert len(skill_md.read_text()) > 100, f"Skill {skill.name} is a hollow shell"

5. Property-Based Pattern (hypothesis invariants)

Define properties that must always hold, instead of testing specific examples. Catches 23-37% more bugs than example-based tests. Use for: pure functions, serialization, data transformations, parsers.

python

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort_preserves_elements(arr):
    """Invariant: sorting never loses or adds elements."""
    result = sorted(arr)
    assert set(result) == set(arr)
    assert len(result) == len(arr)

@given(st.dictionaries(st.text(min_size=1), st.text()))
def test_config_roundtrip(config):
    """Invariant: serialize → deserialize = identity."""
    assert json.loads(json.dumps(config)) == config

When to use: Pure functions, roundtrips, idempotent operations, parsers. When NOT to use: Agent prompts (use GenAI judge), filesystem checks (use structural).

PBT Candidate Selection — When to Use Property-Based Testing

Good candidates (pure functions with testable invariants):

•Input validation functions (validate_agent_name, validate_message, sanitize_*)
•Scoring/normalization functions (normalize_severity, compute_priority)
•Serialize/deserialize roundtrips (settings fix-then-validate, JSON encode/decode)
•Set operations and state machines (circuit breaker threshold, denial counts)
•Mathematical functions with known identities (Fibonacci recurrence)
•Parsers with structural guarantees (acceptance criteria extraction)

Bad candidates (avoid PBT for these):

•Agent prompt behavior — use GenAI judge tests instead; Hypothesis cannot meaningfully generate LLM prompts
•File system operations — use structural tests with tmp_path; filesystem side effects break Hypothesis shrinking
•Config values with fixed schemas — use structural tests; config is a fixed schema, not a property space
•Network/API calls — mock these in unit tests; PBT should test pure computation only

Strategy rules:

•Define all strategies as module-level constants (never inline in test functions)
•Use .filter() instead of assume() for filtering invalid inputs
•Always add @example() decorators with known edge cases alongside @given()

Hypothesis Profile Configuration

Configure profiles via HYPOTHESIS_PROFILE environment variable:

bash

# Default (local development): 50 examples per test
pytest tests/property/ -v

# CI mode: 200 examples per test, no deadline
HYPOTHESIS_PROFILE=ci pytest tests/property/ -v

Profiles are registered in tests/property/conftest.py:

•default: max_examples=50 (fast local iteration)
•ci: max_examples=200, deadline=None (thorough CI runs)

Anti-Patterns (NEVER do these)

Hardcoded counts

python

# BAD — breaks every time a component is added/removed
assert len(agents) == 14
assert hook_count == 17

# GOOD — minimum thresholds + structural checks
assert len(agents) >= 8, "Pipeline needs at least 8 agents"
assert "implementer.md" in agent_names, "Core agent missing"

Hardcoded intermediary lists (the worst anti-pattern)

python

# BAD — test has its OWN copy of expected data, drifts from both real sources
VALID_TOOLS = {"Read", "Write", "Edit"}  # stale copy in test
EXPECTED_COMMANDS = {"implement.md": {"Read", "Write"}}  # another stale copy
assert actual_tools == VALID_TOOLS  # passes even when BOTH sources are wrong

# GOOD — cross-validate real sources directly against each other
policy_tools = set(json.load(open(POLICY_FILE))["tools"]["always_allowed"])
hook_tools = hook.NATIVE_TOOLS
assert policy_tools == hook_tools, f"Drift: policy-only={policy_tools - hook_tools}"

# BEST — add GenAI test to catch gaps in BOTH sources
result = genai.judge(
    question="Are any known tools missing from this list?",
    context=json.dumps(sorted(hook_tools)),
    criteria="Check against known Claude Code native tools..."
)

Rule: When two configs must stay in sync, read both dynamically and compare. Never create a third copy in the test — that's three things that can drift instead of two.

Testing config values

python

# BAD — breaks on every config update
assert settings["version"] == "3.51.0"

# GOOD — test structure, not values
assert "version" in settings
assert re.match(r"\d+\.\d+\.\d+", settings["version"])

Testing file paths that move

python

# BAD — breaks on renames/moves
assert Path("plugins/autonomous-dev/lib/old_name.py").exists()

# GOOD — use glob discovery
assert any(Path("plugins/autonomous-dev/lib").glob("*skill*"))

Rule: If the test itself is the thing that needs updating most often, delete it.

Test Tiers — Diamond Model (auto-categorized by directory)

No manual @pytest.mark needed — directory location determines tier. Source of truth: plugins/autonomous-dev/lib/tier_registry.py.

Tier	Lifecycle	Directory	Markers	Max Duration
T0	permanent	`tests/genai/`	genai, acceptance	-
T0	permanent	`tests/regression/smoke/`	smoke	5s
T1	stable	`tests/e2e/`	e2e, slow	5min
T1	stable	`tests/integration/`	integration	30s
T2	semi-stable	`tests/regression/regression/`	regression	30s
T2	semi-stable	`tests/regression/extended/`	extended, slow	5min
T2	semi-stable	`tests/property/`	property, slow	5min
T3	ephemeral	`tests/regression/progression/`	progression, tdd_red	-
T3	ephemeral	`tests/unit/`	unit	1s
T3	ephemeral	`tests/hooks/`	hooks, unit	1s
T3	ephemeral	`tests/security/`	unit	1s

Lifecycle definitions:

•permanent: Never delete. Critical path validation and acceptance criteria.
•stable: Delete only if the feature being tested is removed from the product.
•semi-stable: Prune after 90 days unused. Feature regression protection.
•ephemeral: Prune freely. Implementation-coupled tests that change with the code.

Where to put a new test:

•Protecting a released critical path? -> regression/smoke/
•Protecting a released feature? -> regression/regression/
•Testing a pure function? -> unit/
•Testing component interaction? -> integration/
•Full workflow end-to-end? -> e2e/
•Checking doc-to-code drift? -> genai/

Run commands:

bash

pytest -m smoke                    # CI gate (T0)
pytest -m "smoke or regression"    # Feature protection (T0+T2)
pytest -m "not slow"               # Fast tests only
pytest tests/genai/ --genai        # GenAI validation (opt-in, T0)

GenAI Test Infrastructure

python

# tests/genai/conftest.py provides two fixtures:
# - genai: Gemini Flash via OpenRouter (cheap, fast)
# - genai_smart: Haiku 4.5 via OpenRouter (complex reasoning)
# Requires: OPENROUTER_API_KEY env var + --genai pytest flag
# Cost: ~$0.02 per full run with 24h response caching

Scaffold for any repo: /scaffold-genai-uat generates the full tests/genai/ setup with portable client, universal tests, and project-specific congruence tests auto-discovered by GenAI.

What to Test vs What Not To

Test This	With This	Not This
Pure Python functions	Unit tests	—
Component interactions	Integration tests	—
Doc ↔ code alignment	GenAI congruence	Hardcoded string matching
Two configs in sync	Cross-validation	Hardcoded intermediary list
Component existence	Structural (glob)	Hardcoded counts
FORBIDDEN list sync	GenAI congruence	Manual comparison
Security posture	GenAI judge	Regex scanning
Config structure	Structural	Config values
Agent output quality	GenAI judge	Output string matching

Test-to-Issue Tracing Convention (Issue #675)

Link tests to GitHub issues for traceability. The TestIssueTracer library (plugins/autonomous-dev/lib/test_issue_tracer.py) scans for these patterns automatically.

Supported Reference Patterns

Pattern	Example	Type
Class name	`class TestIssue656:`	class_name
Function name	`def test_issue_589_regression():`	function_name
Docstring	`"""Regression for #656"""`	docstring
Comment	`# Issue: #656`	comment
GH shorthand	`GH-42`	gh_shorthand
Pytest marker	`@pytest.mark.issue(656)`	marker

Convention Rules

•Regression tests MUST reference the issue they protect (e.g., class TestIssue656 or # Fixes #656)
•Feature tests SHOULD reference the implementing issue (e.g., docstring """Implements #675""")
•Unit tests MAY reference issues when the test covers a specific bug or feature

Usage

python

# Quick check: does an issue have a test?
from test_issue_tracer import TestIssueTracer
tracer = TestIssueTracer(Path('.'))
tracer.check_issue_has_test(675)  # True/False

# Full analysis report
report = tracer.analyze()
print(report.format_table())

Run via /audit --test-tracing for a full tracing report.

Spec-Blind Validation Pattern

An independent agent writes behavioral tests from the spec/acceptance criteria ONLY, without seeing the implementation code or implementer output. This catches cases where the implementation satisfies its own tests but drifts from the original specification.

Isolation rules:

•The spec-validator receives ONLY: acceptance criteria, feature description, changed file paths, PROJECT.md scope
•The spec-validator MUST NOT receive: implementer output, code diffs, reviewer feedback, research findings, planner rationale
•Tests are placed in tests/spec_validation/ (separate from unit tests in tests/unit/)

Test placement: tests/spec_validation/test_spec_{feature_name}.py

Complementarity with other test types:

•Unit tests (implementer): Test internal logic, edge cases, error paths
•Spec-validation tests (spec-validator): Test observable behavior against spec criteria
•Mutation testing: Tests whether test suite catches code mutations (code quality)
•GenAI tests: Semantic evaluation using LLM-as-Judge

The spec-validator adds value because it is structurally blind to implementation details. Even if the implementer writes comprehensive unit tests, those tests are influenced by HOW the code was written. The spec-validator tests WHAT the spec requires.

Verdict: Binary only. SPEC-VALIDATOR-VERDICT: PASS or SPEC-VALIDATOR-VERDICT: FAIL. No partial credit.

Mutation Testing

Mutation testing validates that your tests actually catch real bugs, not just exercise code paths. Coverage metrics give false confidence; mutation testing proves test quality.

What It Is

mutmut introduces small code changes (mutants) — flipping < to <=, True to False, + to - — and checks if your tests detect them. If a test suite still passes after a mutation, that mutant "survived" and your tests have a gap.

When to Use

•After reaching 80%+ coverage on a module, to verify test quality
•When reviewing critical security or state-management code
•As a complement to the diamond test model (mutation testing measures test effectiveness, not code coverage)

How to Run

bash

# Run against three critical files (default)
bash scripts/run_mutation_tests.sh

# Run against a single file
bash scripts/run_mutation_tests.sh --file plugins/autonomous-dev/lib/pipeline_state.py

# Run in CI mode (summary output, non-blocking)
bash scripts/run_mutation_tests.sh --ci

# Run against all of lib/
bash scripts/run_mutation_tests.sh --all

Score Targets

•70%+ mutation score on critical files (pipeline_state.py, tool_validator.py, settings_generator.py)
•Focus on killing conditional, arithmetic, and boolean mutants
•Do NOT chase equivalent mutants (see below)

Equivalent Mutant Triage

Not all surviving mutants indicate test gaps. Some mutations produce functionally equivalent code:

Low-value (skip these):

•String literal changes ("error" to "XXerrorXX") — rarely affects behavior
•Magic number changes (unless they are thresholds)
•Return value mutations on void-like functions

High-value (kill these):

•Conditional mutations (< to <=, == to !=) — missing boundary tests
•Arithmetic mutations (+ to -, * to /) — missing calculation tests
•Boolean mutations (True to False, and to or) — missing logic tests

Integration with Diamond Test Model

Mutation testing is orthogonal to the test tier system. It measures test quality (do tests catch bugs?) rather than test coverage (does code run?). Use it as a quality-of-tests metric:

•Unit tests (T3): Primary targets for mutation testing — pure functions with clear boundaries
•Integration tests (T1): Less useful for mutation testing — too slow per mutant
•GenAI tests (T0): Not applicable — mutation testing targets deterministic logic only

Hard Rules

•100% pass rate required — ALL tests must pass, 0 failures. Coverage targets are separate.
•Specification-driven — tests define the contract; implementation satisfies it.
•0 new skips — @pytest.mark.skip is forbidden for new code. Fix it or adjust expectations.
•Regression test for every bug fix — named test_regression_issue_NNN_description.
•No test is better than a flaky test — if it fails randomly, fix or delete it.
•GenAI tests are opt-in — --genai flag required, no surprise API costs.
•Property over example — prefer hypothesis invariants over hardcoded input/output pairs where applicable.