Test-Driven Autonomous Development
Critical Rules
- •NEVER skip baseline measurement — always establish what "correct" looks like before agents start changing things
- •NEVER let agents make changes without a way to verify correctness — if there's no test, there's no autonomy
- •ALWAYS structure test output so agents can interpret failures without human help
- •WHEN designing tests, every failure message MUST include: what was expected, what happened, and where to look
- •WHEN no reference implementation exists, HELP the user create one before proceeding
- •NEVER optimize test speed at the cost of correctness — fast tests that miss bugs are worse than no tests
Instructions
Step 1: Identify the Oracle
An oracle is a source of truth that tells you what "correct" looks like. Find or create one:
Option A: Reference Implementation (Best) If a known-correct implementation exists (e.g., GCC for a compiler, a legacy system being replaced, a spec with reference outputs):
- •Set up the reference so it can be invoked programmatically
- •Create a harness that runs both the reference and the system under test on the same input
- •Diff the outputs — any difference is a failure
Option B: Snapshot/Golden File Testing If no reference exists but you have known-correct outputs:
- •Run the current (correct) system and capture outputs as golden files
- •After agents make changes, compare new outputs against golden files
- •Differences require manual review or explicit approval
Option C: Property-Based Testing If correct outputs aren't known but invariants are:
- •Define properties that must always hold (e.g., "output is valid JSON", "response time < 500ms", "no data loss")
- •Generate random inputs and verify properties hold
- •Agents can work autonomously as long as no property violations occur
Option D: Human-Verified Seed Tests If nothing above works:
- •Create a small set of manually verified input/output pairs
- •These serve as regression anchors
- •Agents can work but must not break seed tests
- •Expand the seed set as confidence grows
Present options to the user. IF no oracle can be identified → warn that full autonomy is risky and recommend shorter agent sessions with human checkpoints.
Step 2: Design Test Tiers
Organize tests from fast/narrow to slow/comprehensive:
Tier 1: Unit Tests (seconds)
- •Test individual functions or components in isolation
- •Run after every change
- •Must complete in <30 seconds total
- •Purpose: catch obvious breakage immediately
Tier 2: Integration Tests (minutes)
- •Test interactions between components
- •Run after a batch of related changes
- •Must complete in <5 minutes
- •Purpose: catch interface mismatches and data flow issues
Tier 3: System Tests (minutes to hours)
- •Test the full system end-to-end against the oracle
- •Run after a milestone or before merging
- •May take longer but must be comprehensive
- •Purpose: final validation that the whole system works
For each tier, define:
- •What inputs to use
- •What outputs to check
- •Pass/fail criteria
- •How to run (command or script)
Step 3: Create a Fast Subset for Iteration
Agents need rapid feedback. Create a "smoke test" subset:
- •From each tier, select the most representative tests
- •Aim for <60 seconds total runtime
- •Cover the critical paths — if these pass, the system is probably working
- •Include at least one test per major component
Deterministic Sampling Strategy
WHEN the full test suite has N tests:
- •IF N < 50 → run all of them as the fast subset
- •IF N is 50-500 → select ~10% covering each component, prioritize tests that have caught bugs before
- •IF N > 500 → select a fixed set of ~50 covering all components, plus any test that failed in the last 5 runs
The fast subset should be runnable via a single command (e.g., make test-fast or npm test -- --suite=smoke).
Step 4: Structure Error Output for Agent Consumption
Test failures must be machine-readable AND agent-actionable. Every failure should include: test name, component, expected vs actual, diff, investigation path, and context.
See references/error-output-format.md for the full structured format template and test runner configuration guidance.
Step 5: Wire It All Together
Create a test runner script that agents will use:
- •
test-fast— runs the fast subset, returns structured output - •
test-full— runs all tiers sequentially, returns structured output - •
test-against-oracle— runs oracle comparison (if applicable)
Each command should:
- •Exit 0 on all pass, non-zero on any failure
- •Output structured failure information (Step 4 format)
- •Report summary:
PASS: N, FAIL: M, SKIP: K
Configure agents to run test-fast after every change and test-full before committing.
Examples
Example: Designing Tests for a Compiler Project
User says: "I'm building a C compiler and want agents to work on it autonomously"
Result: GCC used as oracle, 3 test tiers (unit/integration/system), fast subset of 25 tests running in 20 seconds, structured error output pointing agents to specific codegen functions.
Example: Tests for an API Migration
User says: "Migrate REST API from v1 to v2, agents should handle each endpoint"
Result: v1 API used as oracle on staging, schema validation per endpoint as fast subset (10 seconds), behavioral parity tests for full validation.
See references/examples.md for detailed walkthroughs of both scenarios.
Troubleshooting
Flaky tests undermine agent confidence
Cause: Non-deterministic tests (timing, ordering, external dependencies) that sometimes pass and sometimes fail. Solution: Quarantine flaky tests out of the fast subset. Fix them separately. Agents should only run deterministic tests autonomously — a flaky failure wastes an entire agent cycle investigating a non-bug.
Oracle drift
Cause: The reference implementation was updated but test expectations weren't. Solution: Pin the oracle version. When updating, re-run all tests to regenerate expected outputs. Keep oracle version in a config file that agents can check.
Slow feedback loops killing agent productivity
Cause: Fast subset is too large or includes slow tests. Solution: Profile test runtime. Move anything over 5 seconds out of the fast subset. Consider running slow tests in a separate background agent that validates while the main agent continues working.
Agent can't interpret test failures
Cause: Test output is a raw stack trace with no actionable context. Solution: Wrap the test runner with a harness that parses failures and adds the structured format from Step 4. Even a simple shell script that greps for FAIL lines and adds component/file information helps.