AgentSkillsCN

testing-prompt-changes

根据给定的DATE_RANGE,从kb/SOURCES.yaml生成经过验证的URL清单。在新闻简报管道的第1A阶段启动时使用此功能。读取SOURCES.yaml中的规范URL与订阅源端点,为每个来源生成候选URL,验证其存在性,并输出一份清单文件。关键词:URL清单、第1A阶段、URL生成、来源发现。

SKILL.md
--- frontmatter
name: testing-prompt-changes
description: "Tests prompt and skill changes using Hypothesis-Implement-Grade-Rework cycles. Use when modifying newsletter prompts, skills, or agent instructions. Compares output against benchmark known-good intermediates. Keywords: test prompt, HIGR testing, prompt validation, benchmark testing."
license: MIT
metadata:
  author: briancl2
  version: "1.0"
  category: system

HIGR Prompt Testing

Test prompt and skill changes systematically with Hypothesis-Implement-Grade-Rework cycles. Every modification requires testable predictions and evidence-based grading against benchmark data.

Quick Start

  1. Write testable hypotheses before editing any prompt or skill
  2. Implement the change (delete superseded content)
  3. Run against benchmark inputs, compare to known-good output
  4. Grade each hypothesis: Good / Bad / Ugly
  5. Rework if needed; record results in workspace/higr_log.md

When to Use

  • Modifying phase prompts (.github/prompts/phase_*.prompt.md)
  • Updating skill SKILL.md or references/
  • Changing agent definition (.github/agents/customer_newsletter.agent.md)
  • Changing copilot-instructions.md
  • Any change that affects newsletter output quality

HIGR Cycle

mermaid
flowchart LR
    H[Hypothesis] --> I[Implement]
    I --> G[Grade]
    G --> R{Rework?}
    R -->|Bad/Ugly| H
    R -->|Good| D[Done]

Step 1: Hypothesis (Pre-Implementation)

Before editing any prompt or skill, write testable predictions:

Hypothesis IDChangeExpected EffectTestable Criteria
H1[specific change][expected output difference][how to verify]

Hypothesis Requirements:

  • Specific — Not "newsletter will be better"
  • Observable — Can see the difference in output
  • Falsifiable — Possible to determine if wrong

Example Good Hypotheses:

IDChangeExpected EffectCriteria
H1Add SOURCES.yaml reading to url-manifestURL count increases, covers all kb/ sourcesURL count within 20% of benchmark, 0 false URLs
H2Extract selection criteria to skill referenceCuration quality matches Dec 2025 benchmark≥70% item overlap with benchmark curated sections
H3Add IDE parity grouping to content-format-specIDE updates grouped correctlyVS Code/Visual Studio/JetBrains in single section

Step 2: Implement

Make the change. Delete superseded content immediately — no parallel implementations.

Rules:

  • No _v2, _new, _enhanced variants
  • No backward compatibility code
  • Replace existing instructions directly

Step 3: Grade (Post-Implementation)

Run against benchmark data and grade each hypothesis:

Hypothesis IDOutcomeGradeEvidence
H1MetGood"URL count: 72 (benchmark: 68)"
H2Partially MetBad"Only 50% overlap vs target 70%"
H3UnexpectedUgly"Grouping present but wrong order"

Grade Definitions:

GradeMeaningAction
GoodHypothesis confirmed, output matches benchmarkAccept change, proceed to next skill
BadHypothesis not met, output doesn't matchRework: adjust skill/prompt, re-test
UglyMajor unexpected deviationEvaluate: may need skill redesign

Step 4: Rework (If Needed)

For each Bad or Ugly outcome:

OutcomeAction
Bad — output degradedFix and re-test
Ugly — harmful surpriseRevert or fix
Ugly — beneficial surpriseDocument and accept
Ugly — ambiguousHalt and reassess

Benchmark Testing

Primary Benchmark: December 2025 Cycle

Known-good intermediates for each pipeline phase:

PhaseSkillKnown-Good Benchmark File
1Aurl-manifestworkspace/archived/newsletter_phase1a_url_manifest_2025-10-06_to_2025-12-02.md
1Bcontent-retrievalworkspace/archived/newsletter_phase1b_interim_*_2025-10-06_to_2025-12-02.md (5 files)
1Ccontent-consolidationworkspace/archived/newsletter_phase1a_discoveries_2025-10-06_to_2025-12-02.md
2events-extractionworkspace/archived/newsletter_phase2_events_2025-12-02.md
3content-curationworkspace/archived/newsletter_phase3_curated_sections_2025-12-02.md
4newsletter-assemblyarchive/2025/December.md (gold standard)

Testing Procedure

  1. Invoke the skill with its benchmark input (DATE_RANGE or known-good upstream output)
  2. Save output to workspace/
  3. Compare against the known-good benchmark file listed above
  4. Grade using the comparison dimensions for that skill (see TRAINING_RUNBOOK.md)

Comparison Dimensions by Phase

PhaseKey Dimensions
1AURL count (within 20%), source coverage (all 5), false URLs (0)
1BFile count (5), extraction format, item counts
1CItem count (30-50), dedup (0 duplicates), enterprise filter
2Table format, canonical categories, date-only for virtual
3Item count (15-20), GA/PREVIEW labels, section structure
4Mandatory sections present, section order, tone, link format

Secondary Benchmarks (Regression Testing)

After changes stabilize, test against additional cycles:

CycleGold StandardBest For
2025-08 Augustarchive/2025/August.mdModerate complexity
2025-06 Junearchive/2025/June.mdDifferent content period

HIGR Log

Record all HIGR cycles in workspace/higr_log.md:

markdown
### HIGR: [Skill] — [Issue] — [Date]

**Hypothesis:** [What should change and why]
**Implement:** [What was changed, file + section]
**Grade:** [Good/Bad/Ugly] — [Evidence]
**Rework:** [If Bad: what was tried next]

Anti-Patterns

Anti-PatternProblemFix
Skipping hypothesesCan't verify improvementAlways write H1+ before editing
Testing without benchmarkNo quality baselineAlways compare to known-good
"Looks better" gradingNot evidence-basedCite specific dimension + metric
Keeping old codeParallel implementationsDelete superseded content
Batch-testing all skillsCan't isolate failuresTest one skill at a time

Done When

  • Hypotheses written (testable, specific, observable)
  • Change implemented (old content deleted)
  • Benchmark test executed against known-good intermediate
  • Each hypothesis graded: Good / Bad / Ugly
  • HIGR log updated in workspace/higr_log.md