Advanced Evaluation
Production-grade techniques for evaluating LLM outputs using LLMs as judges. LLM-as-a-Judge is a family of approaches, each suited to different evaluation contexts.
When to Activate
Activate this skill when:
- •Building automated evaluation pipelines
- •Comparing multiple model responses
- •Establishing consistent quality standards
- •Debugging evaluation systems with inconsistent results
Evaluation Taxonomy
Direct Scoring
Single LLM rates one response on a defined scale.
- •Best for: Objective criteria (factual accuracy, instruction following)
- •Reliability: Moderate to high for well-defined criteria
- •Failure mode: Score calibration drift
Pairwise Comparison
LLM compares two responses and selects the better one.
- •Best for: Subjective preferences (tone, style)
- •Reliability: Higher than direct scoring for preferences
- •Failure mode: Position bias, length bias
The Bias Landscape
| Bias | Description | Mitigation |
|---|---|---|
| Position | First-position responses preferred | Evaluate twice with swapped positions |
| Length | Longer responses rated higher | Explicit prompting to ignore length |
| Self-Enhancement | Models rate own outputs higher | Use different models for generation/evaluation |
| Verbosity | Detailed explanations rated higher | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
Direct Scoring Implementation
Prompt Structure:
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
Critical: Require justification BEFORE score. Improves reliability by 15-25%.
Pairwise Comparison Implementation
Position Bias Mitigation Protocol:
- •First pass: A in first position, B in second
- •Second pass: B in first position, A in second
- •If passes disagree: return TIE with reduced confidence
- •If consistent: averaged confidence
Rubric Generation
Well-defined rubrics reduce evaluation variance by 40-60%.
Components:
- •Level descriptions: Clear boundaries for each score
- •Characteristics: Observable features for each level
- •Examples: Representative text (optional but valuable)
- •Edge cases: Guidance for ambiguous situations
Strictness Calibration:
- •Lenient: Encouraging iteration
- •Balanced: Production use
- •Strict: Safety-critical evaluation
Decision Framework
Is there objective ground truth?
├── Yes → Direct Scoring
│ └── factual accuracy, format compliance
└── No → Is it preference/quality judgment?
├── Yes → Pairwise Comparison
│ └── tone, style, creativity
└── No → Reference-based evaluation
└── summarization, translation
Scaling Evaluation
Panel of LLMs (PoLL)
Multiple models as judges, aggregate votes. Reduces individual model bias.
Hierarchical Evaluation
Fast cheap model for screening, expensive model for edge cases.
Human-in-the-Loop
Automated for clear cases, human review for low-confidence.
Guidelines
- •Always require justification before scores
- •Always swap positions in pairwise comparison
- •Match scale granularity to rubric specificity
- •Separate objective and subjective criteria
- •Include confidence scores
- •Define edge cases explicitly
- •Use domain-specific rubrics
- •Validate against human judgments
- •Monitor for systematic bias
- •Design for iteration
Created: 2024-12-24 | Version: 1.0.0