Code Testing Quality

Principles and practices for writing tests that are reliable, maintainable, and actually catch bugs.

Quick Reference

Testing Quality Checklist

Use during code review and the taster agent's RED phase:

The Three Properties of Good Tests

Property	Meaning	Violation Signal
Isolated	No test affects another; runs alone or in any order	Tests pass individually but fail together
Repeatable	Same result every time, on any machine	"Works on my machine," intermittent failures
Clear	Failure message pinpoints the problem	Must read test code to understand what broke

Test Strategy

Planning Your Testing Approach

Before writing tests, decide what to test and at what level. A test strategy answers three questions:

•Where does the risk live? — Focus testing effort on code where failures cost the most (business logic, data integrity, security boundaries)
•What level of test gives the best signal? — Match test type to what you're verifying (see Test Pyramid below)
•What's the test double strategy? — Real objects internally, fakes/mocks at architectural boundaries

Applying the Test Pyramid

Layer	What to Test Here	Signal It Provides
Unit	Pure logic, calculations, transformations, domain rules	Fast, precise — "this function is correct"
Integration	Database queries, API clients, service wiring, middleware	Realistic — "these components work together"
E2E	Critical user journeys (login, checkout, data export)	Confidence — "the system works end-to-end"

Strategy heuristic: Start with unit tests for business logic. Add integration tests at boundaries where bugs actually happen. Add E2E tests only for the critical paths that must never break.

TDD as a Design Tool

Test-driven development isn't just a testing technique — it's a design feedback loop:

•RED — Write a failing test that describes the next behavior
•GREEN — Write the minimum code to pass
•REFACTOR — Clean up while tests protect you

When TDD helps most: New features with clear behavior, complex algorithms, code you need to understand before changing. When to skip strict TDD: Exploratory prototypes, UI layout, glue code with obvious implementations.

Choosing Test Types by Situation

Situation	Test Type	Why
Pure function with clear inputs/outputs	Unit test	Fast, precise feedback
Database query or ORM logic	Integration test with real/fake DB	Verifies query correctness
API endpoint handler	Integration test	Tests routing, serialization, middleware
Complex business rule with many cases	Parameterized unit test	Covers combinations efficiently
Critical user journey (login, checkout)	E2E test	Validates full stack works together
Service-to-service communication	Contract test	Catches interface drift early
Function with mathematical invariants	Property-based test	Finds edge cases humans miss

Unit Testing Principles

Primary Function: Detect Broken Code

Tests exist to catch regressions. A test suite that passes after breaking production code is worse than no tests — it provides false confidence.

The litmus test: If you introduce a bug, does a test fail? If not, your tests aren't testing the right things.

What to Test vs. What Not to Test

Test Thoroughly	Skip or Test Lightly
Business logic and domain rules	Simple getters/setters with no logic
Edge cases and boundary conditions	Framework-generated boilerplate
Error handling and recovery paths	Third-party library internals
State transitions and workflows	Private methods (test through public API)
Input validation at system boundaries	Configuration constants
Code with high cyclomatic complexity	One-line delegation methods

Rule: Test behavior at the public API boundary. Private methods are implementation details — testing them couples tests to structure, not behavior.

The Test Pyramid

Level	Scope	Speed	When to Use
Unit	Single function/class	Milliseconds	Business logic, algorithms, data transformations
Integration	Component interactions	Seconds	Database queries, API clients, service boundaries
E2E	Full user journeys	Minutes	Critical paths, smoke tests, deployment verification

Many unit tests, fewer integration tests, fewest E2E tests. Each level trades speed for integration confidence.

<details> <summary>Test Pyramid vs. Test Trophy</summary>

The Test Trophy (Kent C. Dodds, frontend-oriented) emphasizes integration tests as the largest layer, with static analysis as the foundation. The rationale: for UI-heavy applications, integration tests provide the best return on investment because they test realistic user interactions.

Shape	Largest Layer	Best For
Pyramid	Unit tests	Backend, algorithmic, library code
Trophy	Integration tests	Frontend, full-stack applications

Both are heuristics. Choose based on where your business logic lives and what gives your team the most confidence per test-minute invested.

</details>

Test Structure

Arrange-Act-Assert (AAA)

The standard unit test structure. Each section should be visually distinct:

code

// Arrange — set up preconditions
val account = Account(balance = 100)
val withdrawal = Withdrawal(amount = 30)

// Act — execute the behavior under test
val result = account.process(withdrawal)

// Assert — verify the outcome
assertEquals(70, result.balance)

Rules:

•One Act per test — multiple acts mean multiple tests hiding in one
•No logic in Arrange — use builders or factories for complex setup
•Assert on behavior — verify what happened, not how it happened

Given-When-Then (GWT)

Semantically identical to AAA but emphasizes behavior from the user's perspective. Preferred for acceptance tests and BDD:

code

// Given a customer with a valid discount code
// When they apply the code at checkout
// Then the total is reduced by the discount percentage

When to use which: AAA for technical unit tests; GWT for behavior-focused and acceptance tests. Consistency within a codebase matters more than the choice itself.

Parameterized / Table-Driven Tests

When multiple inputs exercise the same logic, use parameterized tests instead of copy-pasting:

code

// Table-driven: define logic once, vary the data
testCases = [
    { input: 0,   expected: "zero" },
    { input: 1,   expected: "one"  },
    { input: -1,  expected: "negative" },
    { input: 999, expected: "large" },
]
for each case in testCases:
    assertEquals(case.expected, classify(case.input))

Benefits: Easy to add cases (add a row, not a test), less duplication, better boundary coverage. Name each case for clear failure output.

Test Naming

Convention: Describe Scenario and Outcome

Good test names are sentences that describe what's being tested:

Pattern	Example	Best For
`method_scenario_expectedResult`	`withdraw_insufficientFunds_throwsException`	Unit tests with clear method focus
`should [outcome] when [condition]`	`should reject withdrawal when funds insufficient`	Behavior-focused tests
`given_when_then`	`givenInsufficientFunds_whenWithdraw_thenThrows`	BDD-style tests

Antipattern: test1, testWithdraw, testHappyPath — these names tell you nothing when they fail.

The failure message test: When a test fails, can you understand the bug from the test name alone, without reading the test body? If yes, the name is good.

Test Doubles

Types and When to Use Each

Double	What It Does	Use When
Stub	Returns canned responses	Replacing slow or non-deterministic dependencies
Mock	Verifies interactions occurred	Checking side effects (email sent, event published)
Fake	Working shortcut implementation	Need realistic behavior without production overhead
Spy	Records calls for later inspection	Observing behavior in legacy code

Decision Table: Choosing a Test Double

Situation	Recommended Double	Why
External API or network call	Stub	Isolation from external non-determinism
Database in integration test	Fake (in-memory DB)	Realistic behavior, fast execution
Verifying a notification was sent	Mock	The side effect is the behavior
Complex collaborator with state	Fake	Maintains realistic internal consistency
Legacy code you can't refactor yet	Spy	Observe without modifying production code
Internal collaborator (same module)	Real object	Prefer real dependencies for internal code

The Mock Overuse Problem

Over-mocking creates tests coupled to implementation, not behavior:

Signal	Problem	Fix
Refactoring breaks tests but behavior unchanged	Tests verify how, not what	Test through public API with real collaborators
Mock setup is longer than the test	Too many seams mocked	Use fakes or test larger units
Tests pass but bugs ship	Mocks don't behave like real dependencies	Use fakes or integration tests at boundaries

Rule of thumb: Mock at architectural boundaries (network, database, filesystem, external services). Use real objects for internal collaborators.

<details> <summary>London vs. Detroit school</summary>

Two schools of thought on test isolation:

School	Approach	Tradeoff
London (Mockist)	Mock all collaborators; test classes in pure isolation	Strong design feedback (forces interfaces); brittle to refactoring
Detroit (Classicist)	Use real objects; only mock external boundaries	Resilient to refactoring; less design pressure; wider blast radius on failure

Modern consensus (2025): The industry leans toward the Detroit school — mock less, use fakes at boundaries, test behavior not structure. Pure London-style mocking produces test suites that resist every refactoring, regardless of whether behavior changed.

The pragmatic approach: use real objects for internal collaborators, fakes for infrastructure boundaries, and mocks only when verifying side effects is the point of the test.

</details>

Testing Antipatterns

Antipattern Catalog

Antipattern	Symptom	Severity	Fix
Flaky tests	Pass/fail without code changes	Critical	Remove non-determinism: no sleeps, no real clocks, no shared state
Test interdependence	Tests pass alone, fail together	Critical	Each test creates/destroys its own data; no execution-order assumptions
Testing implementation	Refactoring breaks tests	Major	Assert on outputs and side effects, not internal method calls
Logic in tests	Conditionals, loops, try-catch in test code	Major	Tests should be straight-line code; extract complexity to helpers
Slow tests	Suite takes minutes, developers skip it	Major	Isolate from I/O; use fakes; parallelize; move slow tests to CI
Obscure tests	Must read full test to understand intent	Moderate	Descriptive names, visible AAA structure, minimal setup
Fragile assertions	Tests break on irrelevant output changes	Moderate	Assert on relevant fields only, not entire serialized objects
Liar tests	Tests that always pass regardless of behavior	Critical	Verify tests fail when behavior is wrong (mutation testing)
Giant arrange	50 lines of setup, 1 line of act	Moderate	Use builders/factories; consider whether unit is too large
Snapshot abuse	Blindly approving snapshot updates	Moderate	Use snapshots only for presentational output; prefer explicit assertions

The Flaky Test Problem

Flaky tests are the most damaging antipattern — they erode trust in the entire suite. Common causes:

Cause	Example	Solution
Time dependency	Test uses `Date.now()`	Inject clock; use deterministic time
Shared state	Tests share a database row	Unique test data per test; cleanup in teardown
Race conditions	Async operation not awaited	Use proper async/await; avoid sleep-based waits
Environment dependency	Assumes specific OS, timezone, locale	Explicit environment setup in test; containerize
Order dependency	Test B needs Test A to run first	Each test is self-contained with own setup

Coverage

Meaningful Coverage vs. Metric Gaming

Coverage measures which code runs during tests — not whether tests verify correct behavior. A test with no assertions has 100% coverage and catches zero bugs.

Coverage Target	Context	Rationale
>80% of critical paths	Most projects	Balances thoroughness with diminishing returns
>90%	Safety-critical, financial	Higher stakes justify investment
60-70%	Early-stage, prototypes	Focus on core behavior first

Goodhart's Law applies: When coverage becomes a target, developers write trivial tests to hit the number. 100% coverage of getters is worse than 80% coverage of business logic.

Beyond Line Coverage

Metric	What It Measures	Value
Branch coverage	Decision paths exercised	More meaningful than line coverage
Mutation testing	Whether tests catch injected bugs	Reveals "liar tests" that run code but don't verify it
Critical path coverage	High-risk code paths tested	Focus effort where failures cost the most

Mutation testing introduces small bugs (mutants) into production code and checks whether tests fail. A high mutation score (>80%) indicates tests that actually catch defects, not just execute lines.

<details> <summary>Property-based testing</summary>

Property-based testing (PBT) complements example-based tests by generating random inputs and checking that invariant properties hold:

Example: Instead of testing sort([3,1,2]) == [1,2,3], define properties:

•Output length equals input length
•Every element in output was in input
•Each element is less than or equal to the next

The framework generates hundreds of random inputs and finds the smallest failing case (shrinking).

When to use PBT:

•Functions with clear invariants (sorting, encoding/decoding, serialization)
•Complex domain logic where edge cases are hard to enumerate
•Security-sensitive code (parsing, validation, protocol handling)

When to skip PBT:

•Simple CRUD operations
•UI behavior tests
•When properties are harder to express than example tests

Frameworks: Hypothesis (Python), fast-check (TypeScript/JS), jqwik (JVM).

</details> <details> <summary>Contract testing for service boundaries</summary>

Contract testing verifies that services agree on their communication interface without running full integration tests:

•Consumer-driven: The consumer defines expected requests/responses; the provider verifies it meets them
•Independent deployment: Teams ship without waiting for full integration environments
•Catches integration drift: Detects breaking changes before they hit production

Use contract testing when: microservices, API-first products, multiple teams consuming the same service. Tools: Pact (most mature), Spring Cloud Contract (JVM).

Contract tests complement, not replace, integration and E2E tests.

</details>

Decision Tables

"Why Is This Test Bad?"

Symptom	Likely Antipattern	Action
Test breaks after refactoring (behavior unchanged)	Testing implementation	Assert on outputs, not method calls
Test fails intermittently	Flaky test	Remove non-determinism (time, state, I/O)
Can't understand failure from test name	Obscure test	Rename to describe scenario + expected outcome
50+ lines of setup	Giant arrange	Use builders; consider splitting the unit
Tests pass but bugs ship	Liar test / weak assertions	Add mutation testing; verify tests fail correctly
Test suite takes >5 minutes	Slow tests	Replace I/O with fakes; parallelize; split suites

Checklists

Before Writing a Test

• Identify the specific behavior to verify (not the implementation)
• Determine test level: unit, integration, or E2E
• Choose test double strategy (prefer real objects; mock at boundaries)
• Plan for edge cases and error paths

Test Quality Self-Review

• Each test has one clear reason to fail
• Test names describe scenario and expected outcome
• No shared mutable state between tests
• No non-determinism (time, randomness, network, filesystem)
• Assertions verify behavior, not implementation
• Setup is minimal and relevant (no "just in case" data)
• Test fails when the behavior it guards is broken

code-testing-quality

Code Testing Quality

Quick Reference

Testing Quality Checklist

The Three Properties of Good Tests

Test Strategy

Planning Your Testing Approach

Applying the Test Pyramid

TDD as a Design Tool

Choosing Test Types by Situation

Unit Testing Principles

Primary Function: Detect Broken Code

What to Test vs. What Not to Test

The Test Pyramid

Test Structure

Arrange-Act-Assert (AAA)

Given-When-Then (GWT)

Parameterized / Table-Driven Tests

Test Naming

Convention: Describe Scenario and Outcome

Test Doubles

Types and When to Use Each

Decision Table: Choosing a Test Double

The Mock Overuse Problem

Testing Antipatterns

Antipattern Catalog

The Flaky Test Problem

Coverage

Meaningful Coverage vs. Metric Gaming

Beyond Line Coverage

Decision Tables

"Why Is This Test Bad?"

Checklists

Before Writing a Test

Test Quality Self-Review

See Also