Prompt Testing
Skill for testing, comparing, and measuring prompt performance.
Documentation
- •metrics.md - Performance metrics definition
- •methodology.md - A/B testing protocol
Testing Workflow
text
1. DEFINE └── Test objective └── Metrics to measure └── Success criteria 2. PREPARE └── Variants A and B └── Test dataset └── Baseline (if existing) 3. EXECUTE └── Run on dataset └── Collect results └── Document observations 4. ANALYZE └── Calculate metrics └── Compare variants └── Identify patterns 5. DECIDE └── Recommendation └── Statistical confidence └── Next iterations
Performance Metrics
Quality
| Metric | Description | Calculation |
|---|---|---|
| Accuracy | Correct responses | Correct / Total |
| Compliance | Format adherence | Compliant / Total |
| Consistency | Response stability | 1 - Variance |
| Relevance | Meeting the need | Average score (1-5) |
Efficiency
| Metric | Description | Calculation |
|---|---|---|
| Tokens Input | Prompt size | Token count |
| Tokens Output | Response size | Token count |
| Latency | Response time | ms |
| Cost | Price per request | Tokens × Price |
Robustness
| Metric | Description | Calculation |
|---|---|---|
| Edge Cases | Edge case handling | Passed / Total |
| Jailbreak Resist | Bypass resistance | Blocked / Attempts |
| Error Recovery | Error recovery | Recovered / Errors |
Test Format
Test Dataset
json
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
Test Report
markdown
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
Commands
bash
# Create a test /prompt test create --name "Test v1" --dataset tests.json # Run an A/B test /prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json # View results /prompt test results --id test_001 # Compare two tests /prompt test compare --tests test_001,test_002
Decision Criteria
When to adopt variant B?
text
IF: - Accuracy B >= Accuracy A AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%) AND no regression on edge cases THEN: → Adopt B ELSE IF: - Accuracy improvement > 10% AND token regression < 20% THEN: → Consider B (acceptable trade-off) ELSE: → Keep A or iterate
Best Practices
- •Minimum 20 test cases for significance
- •Include edge cases (15-20% of dataset)
- •Test multiple runs for consistency
- •Document hypotheses before testing
- •Version the prompts being tested