Prompt Iteration Skill
Iteratively improve teaching agent prompts using automated evaluation with local Qwen 2.5-7B as judge.
When to Use
- •After identifying functional issues with the agent
- •When prompt changes need validation before deployment
- •For systematic prompt engineering with measurable improvement
Pre-flight Checks
Before starting, verify:
- •
vLLM is running:
bashcurl http://localhost:8004/v1/models
- •
Golden dataset exists:
bashls tests/golden/
- •
Create iteration branch:
bashgit checkout -b eval/prompt-iteration-$(date +%Y%m%d-%H%M)
Guardrails (USER-DEFINED)
| Guardrail | Value |
|---|---|
| Max iterations | 5 |
| Success gate | 80% pass rate |
| Circuit breaker 1 | Score drops >20% on any dimension |
| Circuit breaker 2 | 3 consecutive iterations with no improvement |
| Rollback mechanism | Git commit after each iteration |
Iteration Loop
Step 1: Baseline Evaluation
Run evaluation on all dimensions:
python -m tests.evaluation.cli --dimension all --output baseline.json
Record the baseline pass rate for each dimension in a TodoWrite checklist.
Step 2: Identify Worst Dimension
From baseline.json, find the dimension with lowest pass rate. Focus on one dimension at a time.
Step 3: Analyze Failures
Run verbose evaluation on the failing dimension:
python -m tests.evaluation.cli --dimension <failing_dimension> --verbose
Review failed test cases to understand the failure pattern:
- •What is the agent doing wrong?
- •Which prompt file controls this behavior?
- •What specific language could fix it?
Step 4: Edit Prompt
Modify the relevant prompt file based on analysis:
| Dimension | Primary Prompt File |
|---|---|
| language_choice | src/backend/app/prompts/agent/mode_practice.md |
| tool_usage | src/backend/app/prompts/agent/tools.md |
| output_cleanliness | src/backend/app/prompts/agent/tools.md |
| confusion_recovery | src/backend/app/prompts/agent/base.md |
| persona_consistency | src/backend/app/prompts/agent/mode_practice.md |
| curriculum_alignment | src/backend/app/prompts/agent/base.md |
Make targeted, minimal changes. Do not rewrite the entire prompt.
Step 5: Re-evaluate
Run evaluation on the dimension you're fixing:
python -m tests.evaluation.cli --dimension <dimension> --output iteration-N.json
Step 6: Check Circuit Breakers
Compare iteration-N.json to baseline.json:
- •
Did any dimension drop >20%?
- •YES:
git checkout src/backend/app/prompts/and STOP - •NO: Continue
- •YES:
- •
Did the target dimension improve?
- •YES: Commit and continue
- •NO: Increment no-improvement counter
- •
3 consecutive no-improvement?
- •YES: STOP and report
- •NO: Continue
Step 7: Commit if Improved
If pass rate improved:
git add src/backend/app/prompts/ git commit -m "prompt: improve <dimension> pass rate N% -> M% - [Description of what changed] - [Why this should help] Tested with: python -m tests.evaluation.cli --dimension <dimension> "
Step 8: Repeat or Stop
Continue loop until one of these conditions:
- •SUCCESS: All dimensions >= 80% pass rate
- •MAX ITERATIONS: 5 iterations completed
- •REGRESSION: Any dimension dropped >20% from baseline
- •STALLED: 3 consecutive iterations with no improvement
Output Format
After loop completes, produce this summary:
## Evaluation Summary **Baseline**: 45% overall **Final**: 82% overall **Iterations**: 4 **Status**: SUCCESS / STOPPED (reason) ### Dimension Results | Dimension | Baseline | Final | Change | |-----------|----------|-------|--------| | language_choice | 40% | 85% | +45% | | tool_usage | 60% | 95% | +35% | | output_cleanliness | 50% | 90% | +40% | | confusion_recovery | 30% | 70% | +40% | | persona_consistency | 40% | 80% | +40% | | curriculum_alignment | 50% | 75% | +25% | ### Commits Made - `abc123` prompt: improve language_choice 40% -> 65% - `def456` prompt: improve tool_usage 60% -> 95% - `ghi789` prompt: improve output_cleanliness 50% -> 90% ### Next Steps - [If success] Ready to merge to main - [If stopped] Manual review needed for dimension X
Quick Reference
# Full evaluation python -m tests.evaluation.cli --dimension all # Single dimension python -m tests.evaluation.cli --dimension language_choice --verbose # Check judge availability python -m tests.evaluation.cli --check # List test cases python -m tests.evaluation.cli --list # A/B comparison (if testing prompt variants) python -m tests.evaluation.cli --compare prompts/v1.md prompts/v2.md
Prompt Files Reference
src/backend/app/prompts/agent/ ├── base.md # Core rules, language handling, confusion recovery ├── mode_help.md # Vocabulary helper mode behavior ├── mode_practice.md # Conversation practice mode behavior └── tools.md # Tool definitions (speak, render_vocabulary, etc.)