Prompt Testing

Name: prompt-testing
Rating: 92
Author: fusengine

Skill for testing, comparing, and measuring prompt performance.

Documentation

•metrics.md - Performance metrics definition
•methodology.md - A/B testing protocol

Testing Workflow

text

1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations

Performance Metrics

Quality

Metric	Description	Calculation
Accuracy	Correct responses	Correct / Total
Compliance	Format adherence	Compliant / Total
Consistency	Response stability	1 - Variance
Relevance	Meeting the need	Average score (1-5)

Efficiency

Metric	Description	Calculation
Tokens Input	Prompt size	Token count
Tokens Output	Response size	Token count
Latency	Response time	ms
Cost	Price per request	Tokens × Price

Robustness

Metric	Description	Calculation
Edge Cases	Edge case handling	Passed / Total
Jailbreak Resist	Bypass resistance	Blocked / Attempts
Error Recovery	Error recovery	Recovered / Errors

Test Format

Test Dataset

json

{
  "name": "Test Dataset v1",
  "description": "Dataset for testing prompt XYZ",
  "cases": [
    {
      "id": "case_001",
      "type": "standard",
      "input": "Test input",
      "expected": "Expected output",
      "tags": ["basic", "format"]
    },
    {
      "id": "case_002",
      "type": "edge_case",
      "input": "Edge input",
      "expected": "Expected behavior",
      "tags": ["edge", "error"]
    }
  ]
}

Test Report

markdown

# A/B Test Report: {{TEST_NAME}}

## Configuration

| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |

## Tested Variants

### Variant A (Baseline)
[Description or link to prompt A]

### Variant B (Challenger)
[Description or link to prompt B]

## Results

### Overall Scores

| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |

### Detail by Case Type

| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |

### Problematic Cases

| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |

## Analysis

### B's Strengths
- [Improvement 1]
- [Improvement 2]

### B's Weaknesses
- [Regression 1]

### Observations
[Qualitative insights]

## Recommendation

**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

**Confidence**: High / Medium / Low

**Justification**:
[Explanation of recommendation]

## Next Steps
1. [Action 1]
2. [Action 2]

Commands

bash

# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

text

IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate

Best Practices

•Minimum 20 test cases for significance
•Include edge cases (15-20% of dataset)
•Test multiple runs for consistency
•Document hypotheses before testing
•Version the prompts being tested