AgentSkillsCN

LLM Judge Patterns

将大语言模型作为自动化评估的“裁判”,探索提示词模式、模型校准、偏见消减与多评委集成等创新实践

SKILL.md
--- frontmatter
name: LLM Judge Patterns
description: Comprehensive guide to using LLMs as judges for automated evaluation including prompt patterns, calibration, bias reduction, and multi-judge ensembles

LLM Judge Patterns

What is LLM-as-Judge?

Definition: Using LLMs (GPT-4, Claude) to evaluate other LLM outputs automatically.

Model

code
Input: Question + Answer (to evaluate)
Judge LLM: GPT-4 or Claude
Output: Score + Reasoning

Example:
Question: "What is the capital of France?"
Answer: "Paris is the capital of France."
Judge: "Score: 5/5 - Correct, concise, directly answers question"

Why LLM-as-Judge?

Human Eval is Slow and Expensive

Comparison:

code
Human evaluation:
- 100 answers × 5 min each = 500 min = 8.3 hours
- Cost: $20/hour × 8.3 = $166

LLM-as-judge:
- 100 answers × 2 sec each = 200 sec = 3.3 min
- Cost: 100 × $0.01 = $1

Need to Evaluate Thousands of Outputs

Scale:

code
Development: Test 1000+ variations
Production: Evaluate millions of responses
Human eval: Impossible at this scale
LLM-judge: Feasible

Research Shows High Correlation with Human Judgment

Studies:

  • GPT-4 as judge correlates 0.8+ with human ratings
  • Works well for subjective quality (fluency, helpfulness)
  • Less reliable for factual correctness

Enables Continuous Evaluation

Workflow:

code
Every response → LLM judge → Score logged → Dashboard
Detect regressions in real-time

When to Use LLM-as-Judge

Subjective Quality (Fluency, Relevance, Helpfulness)

Good Use Cases:

code
- Is this answer helpful?
- Is this text fluent and natural?
- Is this response relevant to the question?
- Is this summary coherent?

Complex Rubrics (Multi-Criteria)

Example:

code
Evaluate on:
1. Accuracy (1-5)
2. Completeness (1-5)
3. Clarity (1-5)
4. Tone (1-5)

LLM can handle multi-dimensional evaluation

Large-Scale Evaluation

When:

code
Need to evaluate 1000+ examples
Human eval too slow/expensive

Rapid Iteration

Development:

code
Test 10 prompt variations
Evaluate each on 100 examples
LLM-judge: Minutes
Human eval: Days

When NOT to Use LLM-as-Judge

Objective Correctness (Factual Answers)

Problem:

code
Question: "What is 2+2?"
Answer: "5"
LLM judge might say: "The answer is clear and confident" (wrong!)

Better: Exact match or computation

Mathematical Reasoning (Verify with Computation)

Better Approach:

code
Execute code to verify answer
Not: Ask LLM if math is correct

Code Correctness (Run Tests)

Better Approach:

code
Run unit tests
Check if code compiles
Not: Ask LLM if code is correct

Safety-Critical (Use Human Evaluation)

Examples:

code
Medical advice
Legal guidance
Financial recommendations

→ Always use human experts

Judge Model Selection

GPT-4 (Most Commonly Used)

Pros:

  • High quality judgments
  • Good correlation with humans
  • Widely tested

Cons:

  • Expensive ($0.03/1K tokens)
  • Can be slow

Claude Sonnet 4 (Excellent Reasoning)

Pros:

  • Excellent reasoning
  • Good for complex evaluations
  • Fast

Cons:

  • Expensive
  • Less tested than GPT-4

GPT-3.5 (Cheaper, Less Accurate)

Pros:

  • Cheap ($0.001/1K tokens)
  • Fast

Cons:

  • Less accurate
  • More biased

Open-Source (Llama, Mixtral)

Pros:

  • Free (if self-hosted)
  • Privacy (on-prem)

Cons:

  • Lower quality
  • Requires infrastructure

Judge Prompt Patterns

Single-Answer Grading

Pattern:

code
You are evaluating an AI assistant's response.

Question: {question}
Answer: {answer}

Rate the answer on a scale of 1-5:
1 = Poor
5 = Excellent

Consider:
- Accuracy
- Relevance
- Completeness

Score:

Example:

python
def single_answer_grading(question, answer):
    prompt = f"""
    You are evaluating an AI assistant's response.
    
    Question: {question}
    Answer: {answer}
    
    Rate the answer on a scale of 1-5:
    1 = Poor (incorrect, irrelevant, or incomplete)
    5 = Excellent (correct, relevant, and complete)
    
    Provide:
    - Score (1-5)
    - Brief reasoning
    
    Format:
    Score: [number]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    score = extract_score(response)
    return score

Pairwise Comparison (A vs B)

Pattern:

code
Which answer is better?

Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

Which is better? A or B?
Explain why.

More Reliable:

code
Pairwise comparison reduces absolute scoring bias
Humans also find comparisons easier than absolute ratings

Example:

python
def pairwise_comparison(question, answer_a, answer_b):
    prompt = f"""
    Question: {question}
    
    Answer A: {answer_a}
    Answer B: {answer_b}
    
    Which answer is better? A or B?
    
    Consider:
    - Accuracy
    - Relevance
    - Clarity
    
    Respond with:
    - Winner: A or B
    - Reasoning: Why is it better?
    
    Format:
    Winner: [A or B]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    winner = extract_winner(response)
    return winner

Aggregate via Elo Ratings:

python
# After many pairwise comparisons
# Calculate Elo rating for each model
# Higher Elo = better model

Multi-Aspect Evaluation (Rubric)

Pattern:

code
Evaluate on multiple criteria:
1. Accuracy (1-5)
2. Relevance (1-5)
3. Completeness (1-5)
4. Clarity (1-5)

Score each separately

Example:

python
def multi_aspect_evaluation(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate on these criteria (1-5 scale):
    
    1. Accuracy: Is the information correct?
       1 = Incorrect, 5 = Perfectly accurate
    
    2. Relevance: Does it answer the question?
       1 = Irrelevant, 5 = Highly relevant
    
    3. Completeness: Does it cover all aspects?
       1 = Incomplete, 5 = Comprehensive
    
    4. Clarity: Is it clear and well-written?
       1 = Confusing, 5 = Very clear
    
    Provide scores and brief reasoning for each.
    
    Format:
    Accuracy: [score] - [reasoning]
    Relevance: [score] - [reasoning]
    Completeness: [score] - [reasoning]
    Clarity: [score] - [reasoning]
    Overall: [average score]
    """
    
    response = llm.generate(prompt)
    scores = extract_scores(response)
    return scores

Chain-of-Thought Judging

Pattern:

code
First, explain your reasoning
Then, provide score

This increases reliability

Example:

python
def cot_judging(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate this answer step by step:
    
    Step 1: Is the answer factually correct?
    Step 2: Does it fully address the question?
    Step 3: Is it clear and well-written?
    
    Based on your analysis, rate the answer (1-5).
    
    Format:
    Step 1: [analysis]
    Step 2: [analysis]
    Step 3: [analysis]
    Final Score: [number]
    """
    
    response = llm.generate(prompt)
    return response

Judge Prompt Template

Comprehensive Template:

code
You are an expert evaluator assessing AI assistant responses.

Question: {question}
Answer: {answer}
{optional: Ground Truth: {ground_truth}}
{optional: Context: {context}}

Evaluate the answer on these criteria:

1. **Accuracy** (1-5): Is the information factually correct?
   - 1 = Completely incorrect
   - 3 = Partially correct
   - 5 = Fully correct

2. **Relevance** (1-5): Does it address the question?
   - 1 = Completely irrelevant
   - 3 = Partially relevant
   - 5 = Directly addresses question

3. **Completeness** (1-5): Does it cover all aspects?
   - 1 = Missing most information
   - 3 = Covers some aspects
   - 5 = Comprehensive

4. **Clarity** (1-5): Is it clear and well-written?
   - 1 = Confusing or poorly written
   - 3 = Acceptable clarity
   - 5 = Very clear and well-written

Provide:
- Score for each criterion (1-5)
- Brief reasoning for each score
- Overall score (average of all criteria)

Format:
Accuracy: [score] - [reasoning]
Relevance: [score] - [reasoning]
Completeness: [score] - [reasoning]
Clarity: [score] - [reasoning]
Overall: [average score]

Judge Calibration

Compare Judge Scores to Human Scores

Process:

code
1. Get 100 examples
2. Human annotators rate each (1-5)
3. LLM judge rates each (1-5)
4. Calculate correlation

Correlation:

python
from scipy.stats import pearsonr

human_scores = [4, 5, 3, 4, 2, ...]
judge_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...]

correlation, p_value = pearsonr(human_scores, judge_scores)
print(f"Correlation: {correlation:.2f}")

# Target: >0.7 (good correlation)
# If <0.7: Adjust prompt or use different judge

Calculate Correlation

See above

Adjust Prompt if Low Correlation

If correlation <0.7:

code
1. Analyze disagreements (where judge differs from human)
2. Update prompt to address issues
3. Re-test correlation
4. Iterate until >0.7

Test on Multiple Examples

Validation Set:

code
Use 100-500 examples with human ratings
Ensure diverse (easy, hard, edge cases)

Reducing Judge Bias

Position Bias (Favors First Option in A/B)

Problem:

code
Judge tends to prefer Answer A over Answer B
Even when B is better

Mitigation:

python
# Randomize order
import random

if random.random() < 0.5:
    winner = compare(question, answer_a, answer_b)
else:
    winner = compare(question, answer_b, answer_a)
    winner = "A" if winner == "B" else "B"  # Flip

Length Bias (Favors Longer Answers)

Problem:

code
Judge tends to prefer longer answers
Even if shorter answer is better

Mitigation:

code
Prompt: "Do not favor longer answers. Concise answers can be better."
Or: Normalize scores by length

Self-Preference Bias (Favors Own Outputs)

Problem:

code
GPT-4 as judge tends to prefer GPT-4 outputs
Over Claude outputs

Mitigation:

code
Use external judge (Claude to judge GPT-4)
Or: Blind evaluation (don't reveal which model)

Multi-Judge Ensemble

Use Multiple Judges (GPT-4 + Claude)

Approach:

python
def multi_judge_ensemble(question, answer):
    # Judge 1: GPT-4
    score_gpt4 = gpt4_judge(question, answer)
    
    # Judge 2: Claude
    score_claude = claude_judge(question, answer)
    
    # Judge 3: GPT-3.5 (cheaper, as tiebreaker)
    score_gpt35 = gpt35_judge(question, answer)
    
    return {
        "gpt4": score_gpt4,
        "claude": score_claude,
        "gpt35": score_gpt35
    }

Aggregate Scores (Majority Vote, Average)

Majority Vote:

python
scores = [4, 5, 4]  # Three judges
majority = max(set(scores), key=scores.count)  # 4

Average:

python
scores = [4.2, 4.8, 4.5]
average = sum(scores) / len(scores)  # 4.5

Weighted Average:

python
scores = {"gpt4": 4.8, "claude": 4.5, "gpt35": 4.0}
weights = {"gpt4": 0.5, "claude": 0.4, "gpt35": 0.1}

weighted_avg = sum(scores[j] * weights[j] for j in scores)  # 4.58

Increases Reliability

Why:

code
Single judge can be wrong
Multiple judges reduce variance
Ensemble is more robust

Cost Optimization

Use Cheaper Judge for Initial Filtering

Two-Stage:

code
Stage 1: GPT-3.5 judge (cheap, fast)
  - Filter out clearly bad answers (score <3)
  
Stage 2: GPT-4 judge (expensive, accurate)
  - Evaluate borderline cases (score 3-4)

Use Expensive Judge for Borderline Cases

See above

Cache Judge Results

Caching:

python
import hashlib
import json

cache = {}

def cached_judge(question, answer):
    # Create cache key
    key = hashlib.md5(f"{question}{answer}".encode()).hexdigest()
    
    # Check cache
    if key in cache:
        return cache[key]
    
    # Call judge
    score = llm_judge(question, answer)
    
    # Cache result
    cache[key] = score
    
    return score

Judge Evaluation Frameworks

G-Eval (Using GPT-4)

Paper: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

Approach:

code
Use GPT-4 to generate evaluation criteria
Then use GPT-4 to evaluate based on those criteria

Prometheus (Using Llama)

Open-Source Judge:

code
Fine-tuned Llama model for evaluation
Free to use
Lower quality than GPT-4 but no API costs

Custom Implementation

See examples throughout this document


Metrics to Track

Judge-Human Correlation

Target: >0.7

Calculation:

python
from scipy.stats import pearsonr

correlation, p_value = pearsonr(human_scores, judge_scores)

Inter-Judge Agreement (If Multiple Judges)

Kappa Score:

python
from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(judge1_scores, judge2_scores)
# >0.7 = good agreement

Judge Consistency (Same Input → Same Output)

Test:

python
# Evaluate same example 10 times
scores = [judge(question, answer) for _ in range(10)]

# Calculate variance
variance = np.var(scores)
# Low variance = consistent judge

Real-World Judge Use Cases

RAG Answer Evaluation

See RAG Evaluation skill

Chatbot Response Quality

Criteria:

  • Helpfulness
  • Relevance
  • Safety
  • Tone

Content Moderation

Criteria:

  • Toxicity
  • Hate speech
  • Misinformation
  • Spam

Translation Quality

Criteria:

  • Accuracy
  • Fluency
  • Preserves meaning

Summarization Quality

Criteria:

  • Completeness
  • Conciseness
  • Accuracy

Limitations

Judge Can Be Wrong (Validate with Humans)

Always:

code
Spot-check judge results with human evaluation
Don't blindly trust judge

Expensive (API Costs)

Cost:

code
1000 evaluations × $0.01 = $10
10,000 evaluations × $0.01 = $100

Can add up quickly

Judge Bias (Needs Careful Prompting)

See "Reducing Judge Bias" section

Not Suitable for All Tasks

See "When NOT to Use" section


Implementation

Judge Prompt Templates

See "Judge Prompt Template" section

Multi-Judge Aggregation

See "Multi-Judge Ensemble" section

Calibration Scripts

python
def calibrate_judge(judge_fn, test_set):
    """
    test_set: List of (question, answer, human_score)
    """
    judge_scores = []
    human_scores = []
    
    for question, answer, human_score in test_set:
        judge_score = judge_fn(question, answer)
        judge_scores.append(judge_score)
        human_scores.append(human_score)
    
    correlation, p_value = pearsonr(human_scores, judge_scores)
    
    return {
        "correlation": correlation,
        "p_value": p_value,
        "judge_scores": judge_scores,
        "human_scores": human_scores
    }

Summary

Quick Reference

LLM-as-Judge: Use LLMs to evaluate other LLM outputs

Why:

  • Fast and cheap vs human eval
  • Scales to thousands of examples
  • High correlation with humans (>0.8)

When to Use:

  • Subjective quality
  • Complex rubrics
  • Large-scale evaluation

When NOT:

  • Objective correctness
  • Math/code (use computation)
  • Safety-critical (use humans)

Judge Models:

  • GPT-4 (best quality)
  • Claude (excellent reasoning)
  • GPT-3.5 (cheaper)
  • Open-source (free but lower quality)

Prompt Patterns:

  • Single-answer grading
  • Pairwise comparison (more reliable)
  • Multi-aspect (rubric)
  • Chain-of-thought (increases reliability)

Bias Reduction:

  • Position bias: Randomize order
  • Length bias: Normalize or prompt
  • Self-preference: External judge

Multi-Judge:

  • Use multiple judges
  • Aggregate (majority vote, average)
  • Increases reliability

Cost Optimization:

  • Cheap judge for filtering
  • Expensive judge for borderline
  • Cache results

Calibration:

  • Compare to human scores
  • Target correlation >0.7
  • Adjust prompt if low

Limitations:

  • Can be wrong (validate with humans)
  • Expensive (API costs)
  • Biased (careful prompting)