Test-Driven Autonomous Development

Critical Rules

•NEVER skip baseline measurement — always establish what "correct" looks like before agents start changing things
•NEVER let agents make changes without a way to verify correctness — if there's no test, there's no autonomy
•ALWAYS structure test output so agents can interpret failures without human help
•WHEN designing tests, every failure message MUST include: what was expected, what happened, and where to look
•WHEN no reference implementation exists, HELP the user create one before proceeding
•NEVER optimize test speed at the cost of correctness — fast tests that miss bugs are worse than no tests

Instructions

Step 1: Identify the Oracle

An oracle is a source of truth that tells you what "correct" looks like. Find or create one:

Option A: Reference Implementation (Best) If a known-correct implementation exists (e.g., GCC for a compiler, a legacy system being replaced, a spec with reference outputs):

•Set up the reference so it can be invoked programmatically
•Create a harness that runs both the reference and the system under test on the same input
•Diff the outputs — any difference is a failure

Option B: Snapshot/Golden File Testing If no reference exists but you have known-correct outputs:

•Run the current (correct) system and capture outputs as golden files
•After agents make changes, compare new outputs against golden files
•Differences require manual review or explicit approval

Option C: Property-Based Testing If correct outputs aren't known but invariants are:

•Define properties that must always hold (e.g., "output is valid JSON", "response time < 500ms", "no data loss")
•Generate random inputs and verify properties hold
•Agents can work autonomously as long as no property violations occur

Option D: Human-Verified Seed Tests If nothing above works:

•Create a small set of manually verified input/output pairs
•These serve as regression anchors
•Agents can work but must not break seed tests
•Expand the seed set as confidence grows

Present options to the user. IF no oracle can be identified → warn that full autonomy is risky and recommend shorter agent sessions with human checkpoints.

Step 2: Design Test Tiers

Organize tests from fast/narrow to slow/comprehensive:

Tier 1: Unit Tests (seconds)

•Test individual functions or components in isolation
•Run after every change
•Must complete in <30 seconds total
•Purpose: catch obvious breakage immediately

Tier 2: Integration Tests (minutes)

•Test interactions between components
•Run after a batch of related changes
•Must complete in <5 minutes
•Purpose: catch interface mismatches and data flow issues

Tier 3: System Tests (minutes to hours)

•Test the full system end-to-end against the oracle
•Run after a milestone or before merging
•May take longer but must be comprehensive
•Purpose: final validation that the whole system works

For each tier, define:

•What inputs to use
•What outputs to check
•Pass/fail criteria
•How to run (command or script)

Step 3: Create a Fast Subset for Iteration

Agents need rapid feedback. Create a "smoke test" subset:

•From each tier, select the most representative tests
•Aim for <60 seconds total runtime
•Cover the critical paths — if these pass, the system is probably working
•Include at least one test per major component

Deterministic Sampling Strategy

WHEN the full test suite has N tests:

•IF N < 50 → run all of them as the fast subset
•IF N is 50-500 → select ~10% covering each component, prioritize tests that have caught bugs before
•IF N > 500 → select a fixed set of ~50 covering all components, plus any test that failed in the last 5 runs

The fast subset should be runnable via a single command (e.g., make test-fast or npm test -- --suite=smoke).

Step 4: Structure Error Output for Agent Consumption

Test failures must be machine-readable AND agent-actionable. Every failure should include: test name, component, expected vs actual, diff, investigation path, and context.

See references/error-output-format.md for the full structured format template and test runner configuration guidance.

Step 5: Wire It All Together

Create a test runner script that agents will use:

•test-fast — runs the fast subset, returns structured output
•test-full — runs all tiers sequentially, returns structured output
•test-against-oracle — runs oracle comparison (if applicable)

Each command should:

•Exit 0 on all pass, non-zero on any failure
•Output structured failure information (Step 4 format)
•Report summary: PASS: N, FAIL: M, SKIP: K

Configure agents to run test-fast after every change and test-full before committing.

Examples

Example: Designing Tests for a Compiler Project

User says: "I'm building a C compiler and want agents to work on it autonomously"

Result: GCC used as oracle, 3 test tiers (unit/integration/system), fast subset of 25 tests running in 20 seconds, structured error output pointing agents to specific codegen functions.

Example: Tests for an API Migration

User says: "Migrate REST API from v1 to v2, agents should handle each endpoint"

Result: v1 API used as oracle on staging, schema validation per endpoint as fast subset (10 seconds), behavioral parity tests for full validation.

See references/examples.md for detailed walkthroughs of both scenarios.

Troubleshooting

Flaky tests undermine agent confidence

Cause: Non-deterministic tests (timing, ordering, external dependencies) that sometimes pass and sometimes fail. Solution: Quarantine flaky tests out of the fast subset. Fix them separately. Agents should only run deterministic tests autonomously — a flaky failure wastes an entire agent cycle investigating a non-bug.

Oracle drift

Cause: The reference implementation was updated but test expectations weren't. Solution: Pin the oracle version. When updating, re-run all tests to regenerate expected outputs. Keep oracle version in a config file that agents can check.

Slow feedback loops killing agent productivity

Cause: Fast subset is too large or includes slow tests. Solution: Profile test runtime. Move anything over 5 seconds out of the fast subset. Consider running slow tests in a separate background agent that validates while the main agent continues working.

Agent can't interpret test failures

Cause: Test output is a raw stack trace with no actionable context. Solution: Wrap the test runner with a harness that parses failures and adds the structured format from Step 4. Even a simple shell script that greps for FAIL lines and adds component/file information helps.