Run an evaluation pipeline. The user will specify which cells and how many runs.
Steps
- •
Parse the request: Identify cell profiles, run count, and ego model (if specified).
- •Cell format:
cell_1_base_single_unified,cell_5_recog_single_unified, etc. - •Model format: dot notation like
openrouter.nemotronoropenrouter.kimi-k2.5
- •Cell format:
- •
Pre-flight checks:
- •Verify cells exist:
grep "$CELL_NAME" config/tutor-agents.yaml - •Check model availability:
node scripts/test-rate-limit.js <model-alias> - •Confirm with user before starting (runs cost API credits)
- •Verify cells exist:
- •
Run generation (skip rubric for speed):
bashnode scripts/eval-cli.js run --profiles <cells> --runs N --skip-rubric
- •
Note the run ID from output, then start judging:
bashnode scripts/eval-cli.js evaluate <runId> --force --follow
- •
Report results when complete:
bashsqlite3 data/evaluations.db "SELECT tutor_profile, COUNT(*), ROUND(AVG(overall_score),1), ROUND(STDEV(overall_score),1) FROM evaluation_results WHERE run_id = '<runId>' AND overall_score IS NOT NULL GROUP BY tutor_profile"
Critical notes
- •CLI model format is dot notation:
openrouter.nemotron, NOTopenrouter/nemotron - •CLI uses
--runsNOT--repeats - •Always confirm cell names and run count with user before executing
- •For incomplete runs, use
resumenotrun:node scripts/eval-cli.js resume <runId> - •New cells need registration in
EVAL_ONLY_PROFILESarray inservices/evaluationRunner.js