Test Strategy (Foundational Skill)

You are the foundational test strategy authority. All other skills—architect, coder, reviewer—MUST consult you before making decisions about testing.

The Overarching Question

What evidence do I need to convince the user that my code correctly implements the specification?

Every test exists to answer this question. Tests are not bureaucracy—they are evidence that your code works. If a test doesn't provide evidence, delete it.

Tests Serve Three Purposes

Purpose	What It Means	Example
Code quickly	Fast feedback loop while building	Run Level 1 tests in <1 second to verify logic as you code
Evidence	Prove the spec is implemented	Show user that acceptance criteria are met
Debug	Find bugs when regressions occur	Named test cases point directly to the broken behavior

The Cardinal Rule: No Mocking

Mocking is always wrong. There is no exception.

Mocking gives you a test that passes while your production code fails. This is worse than no test at all.

If you feel you need to mock:

•Redesign using dependency injection with real in-memory implementations, OR
•Test at a different level—push to Level 2 or 3 where real dependencies are available

The Three Levels

code

┌─────────────────────┐
│      LEVEL 3        │  "Does it work in the real world?"
│   System / E2E      │  Real credentials, real services
│                     │  Full user workflows
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│      LEVEL 2        │  "Does it work with real infrastructure?"
│    Integration      │  Real binaries, real databases
│                     │  Test harnesses required
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│      LEVEL 1        │  "Is our logic correct?"
│    Unit / Pure      │  Standard dev environment only
│                     │  DI with real in-memory implementations
└─────────────────────┘

Level 1: Standard Dev Environment

The question: Is our logic correct, independent of any external system?

Allowed:

Resource	Examples	Why
Test runner	pytest, vitest, jest, go test	Dev environment
Temp directories	`tempfile.mkdtemp()`, `os.tmpdir()`	OS-provided, isolated
Environment vars	Set/read env vars within test	Language runtime
Standard dev tools	git, node, npm (tool), python, curl	CI without setup
DI implementations	In-memory repositories, stub notifiers	Real code, test-focused
Factories/builders	Generate test data programmatically	Reproducible

Forbidden:

Resource	Why
Real databases	Level 2
Real HTTP APIs	Level 2 or 3
Project-specific binaries	Level 2 (ffmpeg, hugo, custom tools)
Installing dependencies	`npm install`, `pip install` are Level 2
Mocking external systems	Never. Redesign with DI instead.

Critical filesystem rule: All Level 1 tests MUST use OS-provided temporary directories exclusively. Never write outside temp directories.

Level 2: Project-Specific Dependencies

The question: Does our code correctly interact with real external dependencies?

Covers:

•Project-specific tools: Hugo, Caddy, FFmpeg, custom binaries
•Project-specific build tools: Make, Gradle, Maven
•Containerized services: Docker databases, message queues

Required before writing: Document the test harness for each dependency.

If you don't know the harness, STOP and ask:

I need to write integration tests for [dependency].

To proceed, I need to know:

•What test harness exists or should I build?

•How do I start/stop/reset it?

•Where are fixture files or seed data?

•What environment variables configure it?

Level 3: Real Environment

The question: Does the complete system work the way users will actually use it?

Covers:

•Real credentials against real (test) environments
•Third-party services in production or staging
•Browser-based testing for web applications
•Complete user workflows end-to-end

Required before writing: Document credentials and test accounts.

If you don't know the credentials, STOP and ask:

I need to write end-to-end tests that use [external service].

To proceed, I need to know:

•Where are the test credentials stored?

•What test accounts/environments exist?

•Are there rate limits or quotas?

•How do I reset test data between runs?

The Three Dimensions

Every test level offers different tradeoffs across three orthogonal dimensions:

1. Detection: What Bugs Can This Level Find?

Each level has a different lens—not progressively broader, but genuinely different:

Level	Detects	Cannot Detect
L1	Algorithmic bugs, edge cases, invariant violations	Race conditions, integration mismatches
L2	Race conditions, integration mismatches, contracts	Pure logic bugs hidden in stack, graphical UX
L3	Workflow breaks, UX failures, real-world edge cases	Algorithmic bugs, intermittent races

Key insight: L1 property-based tests catch algorithmic bugs that L3 almost never detects. L2 randomized harnesses catch race conditions that L1 can't see.

2. Validity: What Does a Passing Test Prove?

Level	A Passing Test Proves	False Confidence Risk
L1	The isolated logic is correct for tested inputs	Low—tests exactly what it claims
L2	Components integrate correctly with the harness	Medium—depends on harness fidelity to production
L3	The workflow succeeds in the test environment	High—can pass while feature is broken in production

Key insight: "E2E tests pass" does NOT mean "it works for users." L3 has the highest false confidence risk.

3. Cost: What Investment Does This Level Require?

Aspect	L1	L2	L3
Upfront	Low (no setup, just code)	Medium (harnesses, fixtures, containers)	High (real env, credentials, test accounts)
Per-run	Milliseconds	Seconds to minutes	Minutes to hours
Maintenance	Low (stable, deterministic)	Medium (harness evolution, dependency updates)	High (brittle to UI changes, flaky, env drift)

Key insight: A test cheap to write can be expensive to maintain. L3 tests often become "tests you're afraid to touch."

Where Evidence Lives

Some outcomes can only be proven at specific levels:

Outcome	Minimum Level	Best Combination
Algorithm correctness	1	L1 with property-based testing
Parser handles grammar	1	L1 typical + edges + properties
User can export data as CSV	1	L1 with temp directory (file I/O is Level 1)
Database query returns correct data	2	L2 integration; L1 only if query logic is complex
CLI binary behaves correctly	2	L2 with fixture files
API contract is honored	2 or 3	L2 against test server; L3 against staging
Clipboard works in browsers	3 only	L3 in real browser; L1/L2 prove nothing
Payment flow completes	3 only	L3 with test credentials

The Clipboard Example

A "copy to clipboard" React component:

•L1 test (component renders): Proves nothing about clipboard functionality
•L3 test (actually copies in browser): Proves the feature works

Best combination: Skip L1/L2 entirely, write L3 tests in target browsers.

The Pricing Engine Example

A complex pricing calculation with discounts, taxes, promotions:

•L2 only (integration test checks total): Knows IF wrong, not WHERE the bug is
•L1 + L2: L1 isolates which rule is broken; L2 confirms end-to-end

Best combination: L1 with full 4-part progression + L2 for integration.

Add Lower Levels for Debuggability

When Level 2 or 3 is required for evidence, add Level 1 tests ONLY if:

•The code is complex—your logic (algorithms, parsers, rules), not library wiring
•Debugging will be hard—when the higher-level test fails, will you know where to look?
•Property-based testing adds value—would generated inputs find edge cases?

Scenario	Add Level 1?	Reason
Integration test fails on a complex algorithm	YES	Level 1 isolates the algorithm
Integration test fails on argparse flag parsing	NO	Trust argparse; check your usage
E2E test fails on payment flow	MAYBE	If payment calculation logic is complex, yes
E2E test fails on clipboard	NO	It's a browser API call, nothing to unit test

Trust the Library

Libraries like argparse, Zod, pydantic, js-yaml are battle-tested. Don't test that they work—test YOUR logic.

Don't test library behavior:

python

# BAD: Tests argparse, not your code
def test_verbose_flag_is_parsed():
    args = parser.parse_args(["--verbose"])
    assert args.verbose is True

Do test your behavior:

python

# GOOD: Tests your logic that uses the parsed result
def test_verbose_mode_produces_detailed_output():
    output = run_command(verbose=True)
    assert "DEBUG:" in output

Randomized Test Harnesses

Always ask: What is the data structure that describes the fixture?

Example: When testing directory tree operations:

•The underlying structure is a DAG (directed acyclic graph)
•Generate a DAG data structure first
•Test your logic against the DAG at Level 1
•Convert to actual directories in temp directory for Level 2

Seeding

•Seeds should derive from system time (different every run)
•Show seed on failure for reproduction
•This maximizes variety while enabling reproduction

The 4-Part Progression

Organize tests at any level to serve all three purposes (code/evidence/debug):

Part 0: Shared Test Values

Create a test values file with named, typed data:

typescript

export const TYPICAL = {
  BASIC: { input: "simple", expected: 42 },
  COMPLEX: { input: "with-flags", expected: 100 },
} as const;

export const EDGES = {
  EMPTY: { input: "", expected: 0 },
  MAX: { input: "x".repeat(1000), expected: "ERROR" },
} as const;

Part 1: Named Typical Cases

One it() per category. When test fails, you know EXACTLY which case.

Part 2: Named Edge Cases

One it() per boundary condition. Each boundary is independently debuggable.

Part 3: Systematic Coverage Loop

Loop over all known cases. Should ONLY fail if Parts 1-2 missed a category.

Part 4: Generated/Property-Based

Reproducible via seed. Escalate from debuggable loops to comprehensive properties.

Level Breadth

Level	Typical Parts	Why
Level 1	All 4 parts	Cheapest—can afford full breadth
Level 2	Parts 1-2, maybe 3	More expensive—focus on key scenarios
Level 3	Part 1 only	Most expensive—critical flows only

Test Location in CODE Framework

Tests are co-located with specs in the spx/ tree:

Location	State	May Fail?	Purpose
`spx/{container}/tests/` (not in outcomes.yaml)	In progress	YES	TDD red-green during development
`spx/{container}/tests/` (in outcomes.yaml)	Validated	NO	Protect working functionality

The invariant: Tests listed in outcomes.yaml MUST ALWAYS PASS (precommit validates this).

No graduation: Tests stay where they are. The outcomes.yaml file tracks which tests have passed. Test level is indicated by filename suffix:

•*.unit.test.{ts,py} - Level 1 (Vitest/pytest)
•*.integration.test.{ts,py} - Level 2 (Vitest/pytest)
•*.e2e.test.{ts,py} - Level 3, non-browser (Vitest/pytest)
•*.e2e.spec.{ts,py} - Level 3, browser-based (Playwright)

Runner separation: Vitest/pytest find *.test.* files, Playwright finds *.spec.* files. No config needed.

Stories persist as containers in the tree. Completion is tracked by outcomes.yaml, not by moving files.

Test Infrastructure

Keep test infrastructure separate from tests. Categories:

1. Test Environment Context Managers

Shared utilities (like withTestEnv) that handle:

•Seeding and reproducibility
•Temp directory lifecycle
•Environment variable isolation
•Shared setup/teardown

2. Containerized Services

Local databases, dev servers, message queues. Managed via docker-compose.

3. Fixtures

Named test values (TYPICAL, EDGES)—static data collections.

4. Generators

Randomized data generation with seeding for reproducibility.

Dependency Injection Pattern

python

# BAD: Hardcoded dependency, requires mocking
class OrderProcessor:
    def process(self, order):
        db = PostgresDatabase()  # Hardcoded!
        db.save(order)


# GOOD: Injected dependencies, testable without mocks
class OrderProcessor:
    def __init__(self, repository):
        self.repository = repository

    def process(self, order):
        self.repository.save(order)


# Level 1 test: Real in-memory implementation
def test_order_processing_saves():
    saved = []

    class InMemoryRepo:
        def save(self, order):
            saved.append(order)

    processor = OrderProcessor(InMemoryRepo())
    processor.process(Order(customer="alice"))

    assert len(saved) == 1

Quick Reference: Level Selection

Evidence needed for...	Level
Business logic	1
Parsing/validation	1
Algorithm output	1
File I/O with temp dirs	1
Database queries	2
HTTP calls	2
CLI binary behavior	2
Full user workflow	3
Real credentials	3
Browser behavior	3
Third-party services	3

Checklist Before Declaring Tests Complete

• Evidence exists at the level where it can be proven
• No mocking anywhere—DI with real implementations
• Level 2 harnesses documented
• Level 3 credentials documented (not hardcoded)
• Tests verify behavior, not implementation
• Regression tests all pass

When You're Stuck

For Level 1:

Can I verify this behavior using only the test runner, language primitives, temp dirs, and DI?

If no → move to Level 2

For Level 2:

What test harness do I need?

If you don't know → STOP AND ASK THE USER

For Level 3:

Where are the credentials?

If you don't know → STOP AND ASK THE USER

The goal is not "passing tests" or "high coverage"—it's justified confidence that your code works in the real world.