Prompt Tester Skill
Test LLM classification prompts against sample data and measure accuracy.
Purpose
Measure classification prompt accuracy using fixtures to validate prompt changes before deployment.
Workflow
Phase 1: Load Prompt and Schema
- •
Load Current Prompt
- •Read from
docs/prompts.md - •Identify prompt version being tested
- •Extract classification schema
- •Read from
- •
Understand Classification Schema
- •Issue types (categories)
- •Priority levels
- •Sentiment scale
- •Churn risk levels
- •Output format expected
Phase 2: Prepare Test Data
- •
Find Sample Data
- •Look for fixtures in
tests/fixtures/ - •Check
data/samples/directory - •Look for labeled conversations in
data/labeled_fixtures.json
- •Look for fixtures in
- •
Verify Sample Quality
- •Are ground truth labels provided?
- •Is sample size sufficient (minimum 10-20)?
- •Do samples cover diverse scenarios?
- •
If No Samples: Request them or note the gap
Phase 3: Run Classifications
- •
For Each Sample
- •Format prompt with conversation data
- •Run classification (call LLM or use test runner)
- •Record predicted vs expected labels
- •
Capture Results
- •Store predictions
- •Note any errors or failures
- •Track processing time if relevant
Phase 4: Calculate Metrics
- •
Overall Accuracy
- •Percentage of correct classifications
- •Total correct / total samples
- •
Per-Category Accuracy
- •Issue type accuracy
- •Priority accuracy
- •Sentiment accuracy
- •Churn risk accuracy
- •
Confusion Patterns
- •What gets misclassified as what?
- •Are there systematic errors?
- •Which categories are problematic?
Phase 5: Analyze Failures
- •
Identify Common Patterns
- •What types of conversations are misclassified?
- •Are there edge cases not handled?
- •Is there ambiguity in ground truth?
- •
Note Ambiguous Cases
- •Where ground truth may be wrong
- •Where multiple labels could be valid
- •
Suggest Improvements
- •Specific prompt modifications
- •Additional examples needed
- •Edge cases to address
Output Format
markdown
## Test Results **Prompt Version**: [version from docs/prompts.md] **Sample Size**: [N] conversations **Overall Accuracy**: [X]% ### Per-Category Accuracy | Category | Accuracy | Notes | | ---------- | -------- | --------------- | | Issue Type | X% | [common errors] | | Priority | X% | [common errors] | | Sentiment | X% | [common errors] | | Churn Risk | X% | [common errors] | ### Failure Analysis 1. [Pattern 1]: [X] cases - [explanation] 2. [Pattern 2]: [X] cases - [explanation] ### Recommendations - [Specific prompt improvement] - [Edge case to handle]
Success Criteria
- • Prompt loaded from
docs/prompts.md - • Test data found or gap documented
- • Classifications run for all samples
- • Metrics calculated (not estimated)
- • Failure patterns identified
- • Recommendations provided
Constraints
- •Report actual numbers, not estimates
- •Flag if sample size too small for statistical significance
- •Note ambiguous ground truth labels
- •Don't modify prompts directly - only recommend changes
Key Files
| File | Purpose |
|---|---|
docs/prompts.md | Current prompt versions |
data/labeled_fixtures.json | Labeled test data |
data/theme_fixtures.json | Theme extraction fixtures |
tests/fixtures/ | Test fixture directory |
Common Pitfalls
- •Estimating accuracy: Always measure with real data
- •Small sample size: Need minimum 10-20 samples for validity
- •Ignoring ambiguity: Some labels are genuinely ambiguous
- •Not documenting version: Always note which prompt version tested
Integration with Kai
This skill is typically invoked by Kai (prompt engineering skill) when:
- •Testing new prompt versions
- •Validating prompt modifications
- •Measuring baseline before changes
- •Comparing before/after accuracy
If Blocked
If you cannot proceed:
- •State what's missing (fixtures, prompt, schema)
- •Explain what you've searched for
- •Request specific data needed
- •Provide partial results if available