Run Improvement Loop
Analyze failed queries from an evaluation run and generate improvements.
Usage
code
/improve [run_id]
Process
- •Load Failures: Read failed evaluations from
evals/results/<run_id>/ - •For Each Failure:
- •Spawn
improveragent with failure context - •Agent diagnoses root cause
- •Agent generates appropriate improvement
- •Agent writes to
knowledge/
- •Spawn
- •Verify: Check that improvements were written correctly
- •Report: Summarize what was learned
Workflow
code
/improve run_001 │ ├─> Load failures from evals/results/run_001/ │ ├─> For each failure: │ ├─> Analyze: What went wrong? │ ├─> Categorize: Example? Function? Doc? │ └─> Write: Update knowledge/ │ └─> Report: "Added 3 examples, 1 function, 2 doc entries"
Output
markdown
# Improvement Report: [run_id] ## Summary - Failures analyzed: N - Improvements generated: M - Examples: X - Functions: Y - Documentation: Z ## Changes Made ### knowledge/examples.md - Added: [pattern name] for query [q_id] ### knowledge/functions.py - Added: `function_name()` - [description] ### knowledge/schema.md - Added: [data fact or edge case discovery] ## Next Steps - Re-run evaluation to verify improvements - Command: `/run-baseline test`
Files Read
- •
evals/results/<run_id>/*.json- Failed evaluation results - •
knowledge/*- Existing knowledge (to avoid duplicates)
Files Written
- •
knowledge/examples.md- Worked examples - •
knowledge/functions.py- Helper functions - •
knowledge/schema.md- Schema/data discoveries