Analyze Eval

When to use

•User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
•User asks "why did this eval fail?" or "what went wrong with this eval?"
•User references a specific eval ID

The visualizer URL pattern is:

code

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps

•$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
•$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

bash

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

Field	Contents
`eval`	Name, category, evalPath, status (pass/fail + failure reason), task text
`run`	Model name, provider, experiment name, run status
`steps`	Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
`outputFiles`	Map of file path -> file content from the model's generated output (unzipped)
`evalSourceFiles`	Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

With the data returned, compare:

•Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
•What did the model generate? — Look at outputFiles for the model's code.
•What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
•What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

•eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
•tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
•convex dev fail — Schema or function definition issues that prevent Convex from deploying.
•tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Classify the failure as one of:

•MODEL_FAULT: The model genuinely got it wrong
•OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
•AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
•KNOWN_GAP: Check evalSourceFiles for a GAPS.txt that documents this issue

Summarize: