Prompt Evolution
Test-driven iterative optimization of the polish prompt in core/src/polish.rs. Maintains a regression test suite to ensure quality monotonically improves.
Key Paths
- •Prompt source:
core/src/polish.rs(thepromptvariable inpolish_text()) - •Test suite:
tests/polish-prompt/suite.json - •Test cases:
tests/polish-prompt/cases/NNN-slug/ - •Evolution log:
tests/polish-prompt/evolution-log.md - •Evaluation criteria: Load
references/evaluation-criteria.mdbefore scoring
Workflow 1: Add Test Case
When the user wants to add a new regression test case.
Steps
- •Read
tests/polish-prompt/suite.jsonto determine the next case number - •Ask the user for:
- •The raw speech transcript (input)
- •The expected polished output
- •Optional: context parameter
- •Create the case directory
tests/polish-prompt/cases/NNN-slug/with:- •
input.txt— raw transcript - •
expected.txt— expected output - •
context.txt— context parameter (only if provided)
- •
- •Add the case entry to
suite.json:json{ "id": "NNN-slug", "description": "Brief description of what this case tests", "has_context": true/false, "scores": [] } - •Run the new case through CLI to establish a baseline score:
If the case has context:bash
cat tests/polish-prompt/cases/NNN-slug/input.txt | cargo run -p diy_typeless_cli -- polish
bashcat tests/polish-prompt/cases/NNN-slug/input.txt | cargo run -p diy_typeless_cli -- polish --context "$(cat tests/polish-prompt/cases/NNN-slug/context.txt)"
- •Score the output against
expected.txtusingreferences/evaluation-criteria.md - •Record the baseline score in
suite.json
Workflow 2: Run Regression Tests
Run all test cases against the current prompt and report results.
Steps
- •Read
tests/polish-prompt/suite.jsonto get all cases - •Load
references/evaluation-criteria.mdfor scoring criteria - •For each case:
a. Read
input.txt(andcontext.txtifhas_contextis true) b. Run through CLI:Or with context:bashcat tests/polish-prompt/cases/{id}/input.txt | cargo run -p diy_typeless_cli -- polishc. Readbashcat tests/polish-prompt/cases/{id}/input.txt | cargo run -p diy_typeless_cli -- polish --context "$(cat tests/polish-prompt/cases/{id}/context.txt)"expected.txtd. Score the actual output against expected using evaluation criteria e. Compare with the last score in the case'sscoresarray - •Report results as a table:
code
| Case | Description | Score | Previous | Delta | |------|-------------|-------|----------|-------| | 001 | ... | 8/10 | 7/10 | +1 |
- •Flag any regressions (score dropped by 2+ or human-unacceptable quality change)
Workflow 3: Optimize Prompt
The core iterative loop. Modify the polish prompt to improve quality.
Steps
- •Identify target: Ask the user what to improve, or run Workflow 2 to find weak spots
- •Diagnose: Analyze why the current prompt produces suboptimal results for the target cases. Read the current prompt in
core/src/polish.rs - •Propose changes: Draft specific prompt modifications. Explain the rationale. Present to user for approval — do NOT modify without confirmation
- •Apply: After user approval, edit
core/src/polish.rswith the prompt changes - •Verify: Run Workflow 2 (full regression suite)
- •Evaluate results:
- •If any case regresses significantly (score drop 2+ or human-unacceptable): rollback the change via
git checkout core/src/polish.rsand report failure - •If target cases improve without regression: proceed
- •If any case regresses significantly (score drop 2+ or human-unacceptable): rollback the change via
- •Record: Append to
tests/polish-prompt/evolution-log.md:markdown## [Date] — Brief title **Target**: What we tried to improve **Change**: What was modified in the prompt **Result**: Score changes summary **Constraint discovered**: Any new insight about what the prompt must preserve
- •Update
suite.jsonwith new scores for all cases
Rollback Protocol
If regression is detected:
- •Immediately rollback:
git checkout core/src/polish.rs - •Report which cases regressed and by how much
- •Analyze why the change caused regression
- •Suggest alternative approaches that might avoid the regression
Workflow 4: View History
Review the prompt evolution trajectory.
Steps
- •Read and display
tests/polish-prompt/evolution-log.md - •Optionally read
suite.jsonto show score trends per case - •Summarize:
- •Total iterations
- •Overall score trajectory
- •Key constraints discovered
- •Current weak spots
Important Notes
- •LLM output variance: Score fluctuations of 1 point that remain human-acceptable are NOT regressions. Only flag drops of 2+ or quality changes a human would reject.
- •Human in the loop: Always get user confirmation before modifying
core/src/polish.rs. - •Close the loop: Always verify via CLI, never assume a prompt change works based on reasoning alone.