Evaluate AI outputs using structured LLM-as-judge methodology.
Core Principles
- •
Binary pass/fail only — never use 1-5 scales. They create false precision. If you need granularity, use multiple independent binary checks on specific failure modes.
- •
Every judgment needs a critique — explain WHY something passed or failed with specific evidence. A score without reasoning is worthless.
- •
Narrow scope per judge — each judge targets ONE specific failure mode. "Is this good?" is not a valid criterion. "Does this hallucinate facts not in the source material?" is.
- •
Code checks first — if you can express the check as an if/else (JSON validity, regex match, schema conformance), use code. Reserve LLM judges for subjective qualities only.
- •
Criteria drift is normal — you cannot fully define evaluation criteria before reviewing outputs. Draft criteria, review outputs, revise criteria. Iterate.
Building a Judge
1. Define the criterion
One narrow, specific failure mode:
- •Bad: "Is the response helpful?"
- •Good: "Does the response answer the user's specific question without introducing information not present in the provided context?"
2. Write the judge prompt
Use the discussion_partners skill to send the judge prompt to an external model:
# SKILL_DIR below refers to the discussion_partners skill directory uv run --directory SKILL_DIR python scripts/ask_model.py "$(cat <<'EOF' You are evaluating an AI response for [CRITERION]. Context: [what the AI was asked to do] Input: [the user's request] Output: [the AI's response] Evaluate ONLY whether [specific criterion]. Ignore all other quality dimensions. Respond in this exact format: CRITIQUE: [2-3 sentences explaining your reasoning with specific evidence from the output] RESULT: PASS or FAIL EOF )"
3. Calibrate with examples
Include 2-3 examples (both pass and fail) in the judge prompt. Draw from real outputs, not synthetic ones.
4. Validate alignment
Run the judge on a sample where you know the correct answer. Check:
- •True positive rate (catches real failures)
- •True negative rate (doesn't flag good outputs)
Target >90% on both. If not, refine the criterion or add examples.
When Another Skill Calls This
Other skills (like prompt_evolution) can use this pattern by:
- •Defining their specific criterion
- •Launching a sub-agent that sends the judge prompt via
discussion_partners - •Parsing the PASS/FAIL result from the response