Analyze Eval
When to use
- •User shares a URL like
https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId - •User asks "why did this eval fail?" or "what went wrong with this eval?"
- •User references a specific eval ID
Step 1: Extract the eval ID from the URL
The visualizer URL pattern is:
code
/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps
- •
$runId— the Convex document ID for the run (e.g.jn7922j1w29pdxm76bj9ps0enx80mg9e) - •
$evalId— the Convex document ID for the specific eval (e.g.jh73jvjz2n00gfeve1dt5h963s80mbc6)
You need the evalId to query.
Step 2: Query the debug action
Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):
bash
npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'
This returns a JSON object with:
| Field | Contents |
|---|---|
eval | Name, category, evalPath, status (pass/fail + failure reason), task text |
run | Model name, provider, experiment name, run status |
steps | Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason |
outputFiles | Map of file path -> file content from the model's generated output (unzipped) |
evalSourceFiles | Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.) |
Step 3: Analyze the failure
With the data returned, compare:
- •Which step failed? — Check
stepsfor the first entry withstatus.kind === "failed". ThefailureReasonfield has the error message. - •What did the model generate? — Look at
outputFilesfor the model's code. - •What was expected? — Look at
evalSourceFilesfor the answer directory and grader test files. - •What was the task? — Check
eval.taskfor the TASK.txt content.
Common failure patterns:
- •eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
- •tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
- •convex dev fail — Schema or function definition issues that prevent Convex from deploying.
- •tests fail — The grader tests didn't pass. Compare
outputFilesagainstevalSourceFiles(look for files likegrader.test.tsoranswer/) to understand what the tests expected.
Step 4: Classify and report findings
Classify the failure as one of:
- •MODEL_FAULT: The model genuinely got it wrong
- •OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
- •AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
- •KNOWN_GAP: Check evalSourceFiles for a GAPS.txt that documents this issue
Summarize:
- •The eval name, model, and experiment
- •Which step failed and the exact error
- •The classification and reasoning
- •The relevant code from the model output that caused the failure
- •What the correct code should look like (from the answer/eval source)
- •Whether any action is recommended (config change, task clarification, etc.)