AgentSkillsCN

functional-testing

功能测试

SKILL.md
--- frontmatter
name: functional-testing
triggers:
  keywords:
    - functional test
    - pipeline test
    - llm test
    - prompt test
    - evidence
dependencies:
  tools:
    - Bash
    - Read

Functional Testing Skill

Validate LLM-dependent code with real outputs before merge (evidence required for PRs).

Purpose

PRs that change LLM-dependent code require functional test evidence BEFORE merge. Unit tests mock LLM responses; functional tests run real outputs.

The Rule

PRs that change LLM-dependent code require functional test evidence BEFORE merge.

Why This Gate Exists

Unit TestsFunctional Tests
Mock LLM responsesReal LLM responses
Fast, deterministicSlower, variable
Catch logic errorsCatch integration errors
Can't validate prompt effectivenessCan validate actual output quality

Unit tests tell you the code works. Functional tests tell you the system works.

When Required

PR TypeWhy Functional Test
New pipeline or phaseSchema failures, token limits, threshold calibration
Major agent changesLLM behavior can't be predicted from code alone
Multi-agent coordinationIntegration issues only surface at runtime
Prompt restructuringOutput format changes may break downstream parsing
Detection regex changesPatterns depend on actual generated content
Evaluator criteria updatesPass/fail thresholds need real content validation

Not Required For

  • Pure frontend changes (no LLM interaction)
  • Infrastructure/config changes (no behavior change)
  • Documentation updates
  • Test-only changes (no production code)

Workflow

Phase 1: Determine if Required

  1. Check PR Type

    • Does this modify prompts?
    • Does this change classification logic?
    • Does this alter pipeline flow?
    • Does this modify LLM-dependent code?
  2. If YES: Functional test required

Phase 2: Run Functional Test

General Pattern

bash
# 1. Trigger the operation that uses LLM
python -m src.pipeline --days 1 --max 10

# 2. Monitor until complete
# (watch logs or check status)

# 3. Verify results meet expectations
python src/cli.py themes  # Check output

What to Verify

ComponentWhat to Check
PromptsOutput format matches expected structure
ParsersRegex/parsing handles real variations
EvaluatorsScores reflect actual quality
PipelinesAll phases complete without errors
ThresholdsPass/fail rates are reasonable

Phase 3: Document Evidence

Include this block in PR description:

markdown
## Functional Test Evidence

**Test Type**: [Full pipeline run / Single phase / Regression test]
**Test Date**: YYYY-MM-DD

### Execution Log

- Task/Run ID: `[id]`
- Steps completed: [list]
- Total runtime: X minutes

### Key Metrics

- [Relevant metrics for this PR, e.g., "0 errors", "all phases passed"]

### Issues Discovered During Testing

- [Any issues found and how they were addressed, or "None"]

**Result**: PASS/FAIL - [Brief explanation]

PRs without this evidence block for pipeline changes will be blocked.

Phase 4: Regression Testing (for Prompt Changes)

When modifying prompts or evaluator logic:

  1. Baseline: Run with current code, record metrics
  2. Change: Apply modifications
  3. Compare: Run same task, compare metrics

Acceptable vs Concerning Changes

MetricAcceptableRed Flag
Output qualitySame or betterSignificant degradation
Pass rateSimilar or higherDrop >10%
Processing timeWithin 20%>2x increase
Error rateSame or lowerNew errors appear

Success Criteria

  • Functional test executed with real LLM calls
  • Evidence formatted per template
  • Key metrics captured
  • Result is PASS (or issues addressed)
  • Evidence attached to PR before merge

Integration with Review

Quinn (Quality Champion reviewer) can flag PRs requiring functional testing:

markdown
## FUNCTIONAL_TEST_REQUIRED

This PR modifies [component] which affects LLM output processing.
Please run a functional test and attach evidence before merge.

When flagged, PR is blocked until evidence is attached.

For Tech Lead

Pre-Merge Checklist

Before merging any PR that touches LLM-dependent code:

code
[ ] Is this PR type in the "required" list?
    If YES:
    [ ] Functional test evidence attached?
    [ ] Evidence shows PASS result?
    [ ] Any issues discovered were addressed?

If Evidence Missing

  1. Comment: Ask for functional test evidence with specific guidance
  2. Block: Do not approve until evidence is provided
  3. Help: Guide developer on how to run the test if needed

Common Pitfalls

  • Skipping for "small" changes: Small prompt tweaks can have big impacts
  • Trusting unit tests alone: Mocks hide real LLM behavior
  • Not documenting evidence: "I tested it" without evidence doesn't count
  • Ignoring Quinn's flag: When Quinn flags, evidence is mandatory

Summary

CheckpointAction
PR touches prompts/pipelinesCheck if functional test required
Test required, no evidenceBlock PR, request evidence
Evidence attachedVerify PASS result before approve
Reviewer flags FUNCTIONAL_TEST_REQUIREDEvidence becomes mandatory
Running functional testUse evidence format template