LLM Judge Patterns

What is LLM-as-Judge?

Definition: Using LLMs (GPT-4, Claude) to evaluate other LLM outputs automatically.

Model

code

Input: Question + Answer (to evaluate)
Judge LLM: GPT-4 or Claude
Output: Score + Reasoning

Example:
Question: "What is the capital of France?"
Answer: "Paris is the capital of France."
Judge: "Score: 5/5 - Correct, concise, directly answers question"

Why LLM-as-Judge?

Human Eval is Slow and Expensive

Comparison:

code

Human evaluation:
- 100 answers × 5 min each = 500 min = 8.3 hours
- Cost: $20/hour × 8.3 = $166

LLM-as-judge:
- 100 answers × 2 sec each = 200 sec = 3.3 min
- Cost: 100 × $0.01 = $1

Need to Evaluate Thousands of Outputs

Scale:

code

Development: Test 1000+ variations
Production: Evaluate millions of responses
Human eval: Impossible at this scale
LLM-judge: Feasible

Research Shows High Correlation with Human Judgment

Studies:

•GPT-4 as judge correlates 0.8+ with human ratings
•Works well for subjective quality (fluency, helpfulness)
•Less reliable for factual correctness

Enables Continuous Evaluation

Workflow:

code

Every response → LLM judge → Score logged → Dashboard
Detect regressions in real-time

When to Use LLM-as-Judge

Subjective Quality (Fluency, Relevance, Helpfulness)

Good Use Cases:

code

- Is this answer helpful?
- Is this text fluent and natural?
- Is this response relevant to the question?
- Is this summary coherent?

Complex Rubrics (Multi-Criteria)

Example:

code

Evaluate on:
1. Accuracy (1-5)
2. Completeness (1-5)
3. Clarity (1-5)
4. Tone (1-5)

LLM can handle multi-dimensional evaluation

Large-Scale Evaluation

When:

code

Need to evaluate 1000+ examples
Human eval too slow/expensive

Rapid Iteration

Development:

code

Test 10 prompt variations
Evaluate each on 100 examples
LLM-judge: Minutes
Human eval: Days

When NOT to Use LLM-as-Judge

Objective Correctness (Factual Answers)

Problem:

code

Question: "What is 2+2?"
Answer: "5"
LLM judge might say: "The answer is clear and confident" (wrong!)

Better: Exact match or computation

Mathematical Reasoning (Verify with Computation)

Better Approach:

code

Execute code to verify answer
Not: Ask LLM if math is correct

Code Correctness (Run Tests)

Better Approach:

code

Run unit tests
Check if code compiles
Not: Ask LLM if code is correct

Safety-Critical (Use Human Evaluation)

Examples:

code

Medical advice
Legal guidance
Financial recommendations

→ Always use human experts

Judge Model Selection

GPT-4 (Most Commonly Used)

Pros:

•High quality judgments
•Good correlation with humans
•Widely tested

Cons:

•Expensive ($0.03/1K tokens)
•Can be slow

Claude Sonnet 4 (Excellent Reasoning)

Pros:

•Excellent reasoning
•Good for complex evaluations
•Fast

Cons:

•Expensive
•Less tested than GPT-4

GPT-3.5 (Cheaper, Less Accurate)

Pros:

•Cheap ($0.001/1K tokens)
•Fast

Cons:

•Less accurate
•More biased

Open-Source (Llama, Mixtral)

Pros:

•Free (if self-hosted)
•Privacy (on-prem)

Cons:

•Lower quality
•Requires infrastructure

Judge Prompt Patterns

Single-Answer Grading

Pattern:

code

You are evaluating an AI assistant's response.

Question: {question}
Answer: {answer}

Rate the answer on a scale of 1-5:
1 = Poor
5 = Excellent

Consider:
- Accuracy
- Relevance
- Completeness

Score:

Example:

python

def single_answer_grading(question, answer):
    prompt = f"""
    You are evaluating an AI assistant's response.
    
    Question: {question}
    Answer: {answer}
    
    Rate the answer on a scale of 1-5:
    1 = Poor (incorrect, irrelevant, or incomplete)
    5 = Excellent (correct, relevant, and complete)
    
    Provide:
    - Score (1-5)
    - Brief reasoning
    
    Format:
    Score: [number]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    score = extract_score(response)
    return score

Pairwise Comparison (A vs B)

Pattern:

code

Which answer is better?

Question: {question}
Answer A: {answer_a}
Answer B: {answer_b}

Which is better? A or B?
Explain why.

More Reliable:

code

Pairwise comparison reduces absolute scoring bias
Humans also find comparisons easier than absolute ratings

Example:

python

def pairwise_comparison(question, answer_a, answer_b):
    prompt = f"""
    Question: {question}
    
    Answer A: {answer_a}
    Answer B: {answer_b}
    
    Which answer is better? A or B?
    
    Consider:
    - Accuracy
    - Relevance
    - Clarity
    
    Respond with:
    - Winner: A or B
    - Reasoning: Why is it better?
    
    Format:
    Winner: [A or B]
    Reasoning: [explanation]
    """
    
    response = llm.generate(prompt)
    winner = extract_winner(response)
    return winner

Aggregate via Elo Ratings:

python

# After many pairwise comparisons
# Calculate Elo rating for each model
# Higher Elo = better model

Multi-Aspect Evaluation (Rubric)

Pattern:

code

Evaluate on multiple criteria:
1. Accuracy (1-5)
2. Relevance (1-5)
3. Completeness (1-5)
4. Clarity (1-5)

Score each separately

Example:

python

def multi_aspect_evaluation(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate on these criteria (1-5 scale):
    
    1. Accuracy: Is the information correct?
       1 = Incorrect, 5 = Perfectly accurate
    
    2. Relevance: Does it answer the question?
       1 = Irrelevant, 5 = Highly relevant
    
    3. Completeness: Does it cover all aspects?
       1 = Incomplete, 5 = Comprehensive
    
    4. Clarity: Is it clear and well-written?
       1 = Confusing, 5 = Very clear
    
    Provide scores and brief reasoning for each.
    
    Format:
    Accuracy: [score] - [reasoning]
    Relevance: [score] - [reasoning]
    Completeness: [score] - [reasoning]
    Clarity: [score] - [reasoning]
    Overall: [average score]
    """
    
    response = llm.generate(prompt)
    scores = extract_scores(response)
    return scores

Chain-of-Thought Judging

Pattern:

code

First, explain your reasoning
Then, provide score

This increases reliability

Example:

python

def cot_judging(question, answer):
    prompt = f"""
    Question: {question}
    Answer: {answer}
    
    Evaluate this answer step by step:
    
    Step 1: Is the answer factually correct?
    Step 2: Does it fully address the question?
    Step 3: Is it clear and well-written?
    
    Based on your analysis, rate the answer (1-5).
    
    Format:
    Step 1: [analysis]
    Step 2: [analysis]
    Step 3: [analysis]
    Final Score: [number]
    """
    
    response = llm.generate(prompt)
    return response

Judge Prompt Template

Comprehensive Template:

code

You are an expert evaluator assessing AI assistant responses.

Question: {question}
Answer: {answer}
{optional: Ground Truth: {ground_truth}}
{optional: Context: {context}}

Evaluate the answer on these criteria:

1. **Accuracy** (1-5): Is the information factually correct?
   - 1 = Completely incorrect
   - 3 = Partially correct
   - 5 = Fully correct

2. **Relevance** (1-5): Does it address the question?
   - 1 = Completely irrelevant
   - 3 = Partially relevant
   - 5 = Directly addresses question

3. **Completeness** (1-5): Does it cover all aspects?
   - 1 = Missing most information
   - 3 = Covers some aspects
   - 5 = Comprehensive

4. **Clarity** (1-5): Is it clear and well-written?
   - 1 = Confusing or poorly written
   - 3 = Acceptable clarity
   - 5 = Very clear and well-written

Provide:
- Score for each criterion (1-5)
- Brief reasoning for each score
- Overall score (average of all criteria)

Format:
Accuracy: [score] - [reasoning]
Relevance: [score] - [reasoning]
Completeness: [score] - [reasoning]
Clarity: [score] - [reasoning]
Overall: [average score]

Judge Calibration

Compare Judge Scores to Human Scores

Process:

code

1. Get 100 examples
2. Human annotators rate each (1-5)
3. LLM judge rates each (1-5)
4. Calculate correlation

Correlation:

python

from scipy.stats import pearsonr

human_scores = [4, 5, 3, 4, 2, ...]
judge_scores = [4.2, 4.8, 3.1, 4.5, 2.3, ...]

correlation, p_value = pearsonr(human_scores, judge_scores)
print(f"Correlation: {correlation:.2f}")

# Target: >0.7 (good correlation)
# If <0.7: Adjust prompt or use different judge

Calculate Correlation

See above

Adjust Prompt if Low Correlation

If correlation <0.7:

code

1. Analyze disagreements (where judge differs from human)
2. Update prompt to address issues
3. Re-test correlation
4. Iterate until >0.7

Test on Multiple Examples

Validation Set:

code

Use 100-500 examples with human ratings
Ensure diverse (easy, hard, edge cases)

Reducing Judge Bias

Position Bias (Favors First Option in A/B)

Problem:

code

Judge tends to prefer Answer A over Answer B
Even when B is better

Mitigation:

python

# Randomize order
import random

if random.random() < 0.5:
    winner = compare(question, answer_a, answer_b)
else:
    winner = compare(question, answer_b, answer_a)
    winner = "A" if winner == "B" else "B"  # Flip

Length Bias (Favors Longer Answers)

Problem:

code

Judge tends to prefer longer answers
Even if shorter answer is better

Mitigation:

code

Prompt: "Do not favor longer answers. Concise answers can be better."
Or: Normalize scores by length

Self-Preference Bias (Favors Own Outputs)

Problem:

code

GPT-4 as judge tends to prefer GPT-4 outputs
Over Claude outputs

Mitigation:

code

Use external judge (Claude to judge GPT-4)
Or: Blind evaluation (don't reveal which model)

Multi-Judge Ensemble

Use Multiple Judges (GPT-4 + Claude)

Approach:

python

def multi_judge_ensemble(question, answer):
    # Judge 1: GPT-4
    score_gpt4 = gpt4_judge(question, answer)
    
    # Judge 2: Claude
    score_claude = claude_judge(question, answer)
    
    # Judge 3: GPT-3.5 (cheaper, as tiebreaker)
    score_gpt35 = gpt35_judge(question, answer)
    
    return {
        "gpt4": score_gpt4,
        "claude": score_claude,
        "gpt35": score_gpt35
    }

Aggregate Scores (Majority Vote, Average)

Majority Vote:

python

scores = [4, 5, 4]  # Three judges
majority = max(set(scores), key=scores.count)  # 4

Average:

python

scores = [4.2, 4.8, 4.5]
average = sum(scores) / len(scores)  # 4.5

Weighted Average:

python

scores = {"gpt4": 4.8, "claude": 4.5, "gpt35": 4.0}
weights = {"gpt4": 0.5, "claude": 0.4, "gpt35": 0.1}

weighted_avg = sum(scores[j] * weights[j] for j in scores)  # 4.58

Increases Reliability

Why:

code

Single judge can be wrong
Multiple judges reduce variance
Ensemble is more robust

Cost Optimization

Use Cheaper Judge for Initial Filtering

Two-Stage:

code

Stage 1: GPT-3.5 judge (cheap, fast)
  - Filter out clearly bad answers (score <3)
  
Stage 2: GPT-4 judge (expensive, accurate)
  - Evaluate borderline cases (score 3-4)

Use Expensive Judge for Borderline Cases

See above

Cache Judge Results

Caching:

python

import hashlib
import json

cache = {}

def cached_judge(question, answer):
    # Create cache key
    key = hashlib.md5(f"{question}{answer}".encode()).hexdigest()
    
    # Check cache
    if key in cache:
        return cache[key]
    
    # Call judge
    score = llm_judge(question, answer)
    
    # Cache result
    cache[key] = score
    
    return score

Judge Evaluation Frameworks

G-Eval (Using GPT-4)

Paper: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

Approach:

code

Use GPT-4 to generate evaluation criteria
Then use GPT-4 to evaluate based on those criteria

Prometheus (Using Llama)

Open-Source Judge:

code

Fine-tuned Llama model for evaluation
Free to use
Lower quality than GPT-4 but no API costs

Custom Implementation

See examples throughout this document

Metrics to Track

Judge-Human Correlation

Target: >0.7

Calculation:

python

from scipy.stats import pearsonr

correlation, p_value = pearsonr(human_scores, judge_scores)

Inter-Judge Agreement (If Multiple Judges)

Kappa Score:

python

from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(judge1_scores, judge2_scores)
# >0.7 = good agreement

Judge Consistency (Same Input → Same Output)

Test:

python

# Evaluate same example 10 times
scores = [judge(question, answer) for _ in range(10)]

# Calculate variance
variance = np.var(scores)
# Low variance = consistent judge

Real-World Judge Use Cases

RAG Answer Evaluation

See RAG Evaluation skill

Chatbot Response Quality

Criteria:

•Helpfulness
•Relevance
•Safety
•Tone

Content Moderation

Criteria:

•Toxicity
•Hate speech
•Misinformation
•Spam

Translation Quality

Criteria:

•Accuracy
•Fluency
•Preserves meaning

Summarization Quality

Criteria:

•Completeness
•Conciseness
•Accuracy

Limitations

Judge Can Be Wrong (Validate with Humans)

Always:

code

Spot-check judge results with human evaluation
Don't blindly trust judge

Expensive (API Costs)

Cost:

code

1000 evaluations × $0.01 = $10
10,000 evaluations × $0.01 = $100

Can add up quickly

Judge Bias (Needs Careful Prompting)

See "Reducing Judge Bias" section

Not Suitable for All Tasks

See "When NOT to Use" section

Implementation

Judge Prompt Templates

See "Judge Prompt Template" section

Multi-Judge Aggregation

See "Multi-Judge Ensemble" section

Calibration Scripts

python

def calibrate_judge(judge_fn, test_set):
    """
    test_set: List of (question, answer, human_score)
    """
    judge_scores = []
    human_scores = []
    
    for question, answer, human_score in test_set:
        judge_score = judge_fn(question, answer)
        judge_scores.append(judge_score)
        human_scores.append(human_score)
    
    correlation, p_value = pearsonr(human_scores, judge_scores)
    
    return {
        "correlation": correlation,
        "p_value": p_value,
        "judge_scores": judge_scores,
        "human_scores": human_scores
    }

Summary

Quick Reference

LLM-as-Judge: Use LLMs to evaluate other LLM outputs

Why:

•Fast and cheap vs human eval
•Scales to thousands of examples
•High correlation with humans (>0.8)

When to Use:

•Subjective quality
•Complex rubrics
•Large-scale evaluation

When NOT:

•Objective correctness
•Math/code (use computation)
•Safety-critical (use humans)

Judge Models:

•GPT-4 (best quality)
•Claude (excellent reasoning)
•GPT-3.5 (cheaper)
•Open-source (free but lower quality)

Prompt Patterns:

•Single-answer grading
•Pairwise comparison (more reliable)
•Multi-aspect (rubric)
•Chain-of-thought (increases reliability)

Bias Reduction:

•Position bias: Randomize order
•Length bias: Normalize or prompt
•Self-preference: External judge

Multi-Judge:

•Use multiple judges
•Aggregate (majority vote, average)
•Increases reliability

Cost Optimization:

•Cheap judge for filtering
•Expensive judge for borderline
•Cache results

Calibration:

•Compare to human scores
•Target correlation >0.7
•Adjust prompt if low

Limitations:

•Can be wrong (validate with humans)
•Expensive (API costs)
•Biased (careful prompting)