Dual-Judge Evaluation
Run two LLM judges against a task's output to verify quality before advancing.
Arguments
The user invoked this command with: $ARGUMENTS
Arguments can be:
- •A commit SHA:
/judge a5f7dce - •A commit SHA + spec path:
/judge a5f7dce specs/my-task.md - •Nothing (uses HEAD):
/judge
What To Do
- •
Resolve the commit. If
$ARGUMENTScontains a SHA, use it. Otherwise use HEAD:bashgit -C /home/clawd log --oneline -1 HEAD
- •
Find the spec. If a spec path was given, use it. Otherwise, check:
- •Is there a spec file mentioned in recent context?
- •Check
systems/orchestrator/logs/for the latest build's task entries - •If no spec is found, ask the human: "Which spec file should I judge this commit against?"
- •
Find the taskboard. Look for the most relevant one:
- •Check for
IMPLEMENTATION_PLAN.mdortaskboard.mdin the working directory - •If none found, create a minimal one from the commit message
- •Check for
- •
Get the auth token:
bashOPENCLAW_TOKEN=$(python3 -c "import json; print(json.load(open('/home/dcarmitage/.openclaw/openclaw.json'))['gateway']['auth']['token'])") - •
Run both judges in parallel:
bash# Save diff to temp file (process substitution doesn't work reliably) git -C /home/clawd show <commit> --format="" > /tmp/judge_diff.txt OPENCLAW_TOKEN=$OPENCLAW_TOKEN bash /home/clawd/evals/logic_judge.sh \ --spec <spec_path> \ --diff /tmp/judge_diff.txt \ --test-output "<any test output from the build, or describe what was verified>" OPENCLAW_TOKEN=$OPENCLAW_TOKEN bash /home/clawd/evals/consistency_judge.sh \ --spec <spec_path> \ --taskboard <taskboard_path> \ --diff /tmp/judge_diff.txt \ --changed-files "$(git -C /home/clawd show <commit> --name-only --format="" | tr '\n' ',')"
- •
Present results clearly. Use a formatted summary like:
codeDUAL-JUDGE EVALUATION — Commit <sha> "<commit message>" LOGIC JUDGE: X.X/10 (TIER) ✓/✗ [severity] claim text ... CONSISTENCY JUDGE: X.X/10 (TIER) ✓/✗ check name: X/10 ... VERDICT: PASS/FAIL (threshold: >= 8.0 both judges)
- •
If either judge fails (< 8.0): Tell the human what failed and why. Reference E6 — failed judges require root cause analysis, not retries.
Important
- •Always save the diff to a temp file. Do NOT use bash process substitution (
<(...)) — it doesn't reliably pass content to the scripts. - •Always set
OPENCLAW_TOKENfrom the config file. - •Present results in human-readable format, not raw JSON.
- •If judges score below 8.0, the task is blocked. Don't sugarcoat it.