AgentSkillsCN

code-testing-quality

单元测试原则、测试结构模式、测试替身策略、命名规范、测试反模式,以及覆盖率指导方针。在编写测试、审查测试质量、评估测试覆盖率、选择测试替身、修复不稳定测试,或在构建与服务阶段评估测试结构时使用此功能。在TDD周期中,强化测试代理的作用。

SKILL.md
--- frontmatter
name: code-testing-quality
description: Unit testing principles, test structure patterns, test doubles strategy, naming conventions, testing antipatterns, and coverage guidance. Use when writing tests, reviewing test quality, evaluating test coverage, choosing test doubles, fixing flaky tests, or assessing test structure during cook and serve phases. Enhances taster agent during TDD cycle.

Code Testing Quality

Principles and practices for writing tests that are reliable, maintainable, and actually catch bugs.

Quick Reference

Testing Quality Checklist

Use during code review and the taster agent's RED phase:

  • Each test verifies one behavior (single assertion theme)
  • Tests are independent — no shared mutable state, runs in any order
  • Test names describe the scenario and expected outcome
  • Arrange/Act/Assert sections are visually distinct
  • No logic in tests (no conditionals, loops, or try-catch)
  • Test doubles used only at architectural boundaries
  • Flaky indicators absent (sleeps, real clocks, network calls, file I/O)
  • Edge cases covered (nulls, empty collections, boundaries, error paths)
  • Tests verify behavior, not implementation details
  • Refactoring production code doesn't break tests (unless behavior changes)

The Three Properties of Good Tests

PropertyMeaningViolation Signal
IsolatedNo test affects another; runs alone or in any orderTests pass individually but fail together
RepeatableSame result every time, on any machine"Works on my machine," intermittent failures
ClearFailure message pinpoints the problemMust read test code to understand what broke

Test Strategy

Planning Your Testing Approach

Before writing tests, decide what to test and at what level. A test strategy answers three questions:

  1. Where does the risk live? — Focus testing effort on code where failures cost the most (business logic, data integrity, security boundaries)
  2. What level of test gives the best signal? — Match test type to what you're verifying (see Test Pyramid below)
  3. What's the test double strategy? — Real objects internally, fakes/mocks at architectural boundaries

Applying the Test Pyramid

LayerWhat to Test HereSignal It Provides
UnitPure logic, calculations, transformations, domain rulesFast, precise — "this function is correct"
IntegrationDatabase queries, API clients, service wiring, middlewareRealistic — "these components work together"
E2ECritical user journeys (login, checkout, data export)Confidence — "the system works end-to-end"

Strategy heuristic: Start with unit tests for business logic. Add integration tests at boundaries where bugs actually happen. Add E2E tests only for the critical paths that must never break.

TDD as a Design Tool

Test-driven development isn't just a testing technique — it's a design feedback loop:

  1. RED — Write a failing test that describes the next behavior
  2. GREEN — Write the minimum code to pass
  3. REFACTOR — Clean up while tests protect you

When TDD helps most: New features with clear behavior, complex algorithms, code you need to understand before changing. When to skip strict TDD: Exploratory prototypes, UI layout, glue code with obvious implementations.

Choosing Test Types by Situation

SituationTest TypeWhy
Pure function with clear inputs/outputsUnit testFast, precise feedback
Database query or ORM logicIntegration test with real/fake DBVerifies query correctness
API endpoint handlerIntegration testTests routing, serialization, middleware
Complex business rule with many casesParameterized unit testCovers combinations efficiently
Critical user journey (login, checkout)E2E testValidates full stack works together
Service-to-service communicationContract testCatches interface drift early
Function with mathematical invariantsProperty-based testFinds edge cases humans miss

Unit Testing Principles

Primary Function: Detect Broken Code

Tests exist to catch regressions. A test suite that passes after breaking production code is worse than no tests — it provides false confidence.

The litmus test: If you introduce a bug, does a test fail? If not, your tests aren't testing the right things.

What to Test vs. What Not to Test

Test ThoroughlySkip or Test Lightly
Business logic and domain rulesSimple getters/setters with no logic
Edge cases and boundary conditionsFramework-generated boilerplate
Error handling and recovery pathsThird-party library internals
State transitions and workflowsPrivate methods (test through public API)
Input validation at system boundariesConfiguration constants
Code with high cyclomatic complexityOne-line delegation methods

Rule: Test behavior at the public API boundary. Private methods are implementation details — testing them couples tests to structure, not behavior.

The Test Pyramid

LevelScopeSpeedWhen to Use
UnitSingle function/classMillisecondsBusiness logic, algorithms, data transformations
IntegrationComponent interactionsSecondsDatabase queries, API clients, service boundaries
E2EFull user journeysMinutesCritical paths, smoke tests, deployment verification

Many unit tests, fewer integration tests, fewest E2E tests. Each level trades speed for integration confidence.

<details> <summary>Test Pyramid vs. Test Trophy</summary>

The Test Trophy (Kent C. Dodds, frontend-oriented) emphasizes integration tests as the largest layer, with static analysis as the foundation. The rationale: for UI-heavy applications, integration tests provide the best return on investment because they test realistic user interactions.

ShapeLargest LayerBest For
PyramidUnit testsBackend, algorithmic, library code
TrophyIntegration testsFrontend, full-stack applications

Both are heuristics. Choose based on where your business logic lives and what gives your team the most confidence per test-minute invested.

</details>

Test Structure

Arrange-Act-Assert (AAA)

The standard unit test structure. Each section should be visually distinct:

code
// Arrange — set up preconditions
val account = Account(balance = 100)
val withdrawal = Withdrawal(amount = 30)

// Act — execute the behavior under test
val result = account.process(withdrawal)

// Assert — verify the outcome
assertEquals(70, result.balance)

Rules:

  • One Act per test — multiple acts mean multiple tests hiding in one
  • No logic in Arrange — use builders or factories for complex setup
  • Assert on behavior — verify what happened, not how it happened

Given-When-Then (GWT)

Semantically identical to AAA but emphasizes behavior from the user's perspective. Preferred for acceptance tests and BDD:

code
// Given a customer with a valid discount code
// When they apply the code at checkout
// Then the total is reduced by the discount percentage

When to use which: AAA for technical unit tests; GWT for behavior-focused and acceptance tests. Consistency within a codebase matters more than the choice itself.

Parameterized / Table-Driven Tests

When multiple inputs exercise the same logic, use parameterized tests instead of copy-pasting:

code
// Table-driven: define logic once, vary the data
testCases = [
    { input: 0,   expected: "zero" },
    { input: 1,   expected: "one"  },
    { input: -1,  expected: "negative" },
    { input: 999, expected: "large" },
]
for each case in testCases:
    assertEquals(case.expected, classify(case.input))

Benefits: Easy to add cases (add a row, not a test), less duplication, better boundary coverage. Name each case for clear failure output.

Test Naming

Convention: Describe Scenario and Outcome

Good test names are sentences that describe what's being tested:

PatternExampleBest For
method_scenario_expectedResultwithdraw_insufficientFunds_throwsExceptionUnit tests with clear method focus
should [outcome] when [condition]should reject withdrawal when funds insufficientBehavior-focused tests
given_when_thengivenInsufficientFunds_whenWithdraw_thenThrowsBDD-style tests

Antipattern: test1, testWithdraw, testHappyPath — these names tell you nothing when they fail.

The failure message test: When a test fails, can you understand the bug from the test name alone, without reading the test body? If yes, the name is good.

Test Doubles

Types and When to Use Each

DoubleWhat It DoesUse When
StubReturns canned responsesReplacing slow or non-deterministic dependencies
MockVerifies interactions occurredChecking side effects (email sent, event published)
FakeWorking shortcut implementationNeed realistic behavior without production overhead
SpyRecords calls for later inspectionObserving behavior in legacy code

Decision Table: Choosing a Test Double

SituationRecommended DoubleWhy
External API or network callStubIsolation from external non-determinism
Database in integration testFake (in-memory DB)Realistic behavior, fast execution
Verifying a notification was sentMockThe side effect is the behavior
Complex collaborator with stateFakeMaintains realistic internal consistency
Legacy code you can't refactor yetSpyObserve without modifying production code
Internal collaborator (same module)Real objectPrefer real dependencies for internal code

The Mock Overuse Problem

Over-mocking creates tests coupled to implementation, not behavior:

SignalProblemFix
Refactoring breaks tests but behavior unchangedTests verify how, not whatTest through public API with real collaborators
Mock setup is longer than the testToo many seams mockedUse fakes or test larger units
Tests pass but bugs shipMocks don't behave like real dependenciesUse fakes or integration tests at boundaries

Rule of thumb: Mock at architectural boundaries (network, database, filesystem, external services). Use real objects for internal collaborators.

<details> <summary>London vs. Detroit school</summary>

Two schools of thought on test isolation:

SchoolApproachTradeoff
London (Mockist)Mock all collaborators; test classes in pure isolationStrong design feedback (forces interfaces); brittle to refactoring
Detroit (Classicist)Use real objects; only mock external boundariesResilient to refactoring; less design pressure; wider blast radius on failure

Modern consensus (2025): The industry leans toward the Detroit school — mock less, use fakes at boundaries, test behavior not structure. Pure London-style mocking produces test suites that resist every refactoring, regardless of whether behavior changed.

The pragmatic approach: use real objects for internal collaborators, fakes for infrastructure boundaries, and mocks only when verifying side effects is the point of the test.

</details>

Testing Antipatterns

Antipattern Catalog

AntipatternSymptomSeverityFix
Flaky testsPass/fail without code changesCriticalRemove non-determinism: no sleeps, no real clocks, no shared state
Test interdependenceTests pass alone, fail togetherCriticalEach test creates/destroys its own data; no execution-order assumptions
Testing implementationRefactoring breaks testsMajorAssert on outputs and side effects, not internal method calls
Logic in testsConditionals, loops, try-catch in test codeMajorTests should be straight-line code; extract complexity to helpers
Slow testsSuite takes minutes, developers skip itMajorIsolate from I/O; use fakes; parallelize; move slow tests to CI
Obscure testsMust read full test to understand intentModerateDescriptive names, visible AAA structure, minimal setup
Fragile assertionsTests break on irrelevant output changesModerateAssert on relevant fields only, not entire serialized objects
Liar testsTests that always pass regardless of behaviorCriticalVerify tests fail when behavior is wrong (mutation testing)
Giant arrange50 lines of setup, 1 line of actModerateUse builders/factories; consider whether unit is too large
Snapshot abuseBlindly approving snapshot updatesModerateUse snapshots only for presentational output; prefer explicit assertions

The Flaky Test Problem

Flaky tests are the most damaging antipattern — they erode trust in the entire suite. Common causes:

CauseExampleSolution
Time dependencyTest uses Date.now()Inject clock; use deterministic time
Shared stateTests share a database rowUnique test data per test; cleanup in teardown
Race conditionsAsync operation not awaitedUse proper async/await; avoid sleep-based waits
Environment dependencyAssumes specific OS, timezone, localeExplicit environment setup in test; containerize
Order dependencyTest B needs Test A to run firstEach test is self-contained with own setup

Coverage

Meaningful Coverage vs. Metric Gaming

Coverage measures which code runs during tests — not whether tests verify correct behavior. A test with no assertions has 100% coverage and catches zero bugs.

Coverage TargetContextRationale
>80% of critical pathsMost projectsBalances thoroughness with diminishing returns
>90%Safety-critical, financialHigher stakes justify investment
60-70%Early-stage, prototypesFocus on core behavior first

Goodhart's Law applies: When coverage becomes a target, developers write trivial tests to hit the number. 100% coverage of getters is worse than 80% coverage of business logic.

Beyond Line Coverage

MetricWhat It MeasuresValue
Branch coverageDecision paths exercisedMore meaningful than line coverage
Mutation testingWhether tests catch injected bugsReveals "liar tests" that run code but don't verify it
Critical path coverageHigh-risk code paths testedFocus effort where failures cost the most

Mutation testing introduces small bugs (mutants) into production code and checks whether tests fail. A high mutation score (>80%) indicates tests that actually catch defects, not just execute lines.

<details> <summary>Property-based testing</summary>

Property-based testing (PBT) complements example-based tests by generating random inputs and checking that invariant properties hold:

Example: Instead of testing sort([3,1,2]) == [1,2,3], define properties:

  • Output length equals input length
  • Every element in output was in input
  • Each element is less than or equal to the next

The framework generates hundreds of random inputs and finds the smallest failing case (shrinking).

When to use PBT:

  • Functions with clear invariants (sorting, encoding/decoding, serialization)
  • Complex domain logic where edge cases are hard to enumerate
  • Security-sensitive code (parsing, validation, protocol handling)

When to skip PBT:

  • Simple CRUD operations
  • UI behavior tests
  • When properties are harder to express than example tests

Frameworks: Hypothesis (Python), fast-check (TypeScript/JS), jqwik (JVM).

</details> <details> <summary>Contract testing for service boundaries</summary>

Contract testing verifies that services agree on their communication interface without running full integration tests:

  • Consumer-driven: The consumer defines expected requests/responses; the provider verifies it meets them
  • Independent deployment: Teams ship without waiting for full integration environments
  • Catches integration drift: Detects breaking changes before they hit production

Use contract testing when: microservices, API-first products, multiple teams consuming the same service. Tools: Pact (most mature), Spring Cloud Contract (JVM).

Contract tests complement, not replace, integration and E2E tests.

</details>

Decision Tables

"Why Is This Test Bad?"

SymptomLikely AntipatternAction
Test breaks after refactoring (behavior unchanged)Testing implementationAssert on outputs, not method calls
Test fails intermittentlyFlaky testRemove non-determinism (time, state, I/O)
Can't understand failure from test nameObscure testRename to describe scenario + expected outcome
50+ lines of setupGiant arrangeUse builders; consider splitting the unit
Tests pass but bugs shipLiar test / weak assertionsAdd mutation testing; verify tests fail correctly
Test suite takes >5 minutesSlow testsReplace I/O with fakes; parallelize; split suites

Checklists

Before Writing a Test

  • Identify the specific behavior to verify (not the implementation)
  • Determine test level: unit, integration, or E2E
  • Choose test double strategy (prefer real objects; mock at boundaries)
  • Plan for edge cases and error paths

Test Quality Self-Review

  • Each test has one clear reason to fail
  • Test names describe scenario and expected outcome
  • No shared mutable state between tests
  • No non-determinism (time, randomness, network, filesystem)
  • Assertions verify behavior, not implementation
  • Setup is minimal and relevant (no "just in case" data)
  • Test fails when the behavior it guards is broken

See Also

  • code-quality-foundations — Testability as a design pillar (see code-quality-foundations -> Make Code Testable)
  • code-review — Evaluating test quality during review (see code-review -> Reviewer Checklist)
  • code-antipatterns — Quality failures that make code untestable (see code-antipatterns -> Pattern Recognition)
  • refactoring-patterns — Making untestable code testable (see refactoring-patterns -> When to Refactor)