Advanced Evaluation

Name: Advanced Evaluation
Rating: 78
Author: mediar-ai

Production-grade techniques for evaluating LLM outputs using LLM-as-judge approaches with bias mitigation.

Prerequisites

•Understanding of evaluation metrics
•Access to LLM APIs for judge models

Instructions

Core Approaches

Direct Scoring: Single LLM rates one response on a defined scale.

•Best for: Objective criteria (factual accuracy, instruction following)
•Requires: Clear criteria, calibrated scale, chain-of-thought justification

Pairwise Comparison: LLM compares two responses and selects the better one.

•Best for: Subjective preferences (tone, style, persuasiveness)
•Requires: Position bias mitigation (swap positions and check consistency)

Bias Mitigation

Bias	Mitigation
Position Bias	Evaluate twice with swapped positions
Length Bias	Explicit prompting to ignore length
Self-Enhancement	Use different models for generation and evaluation
Verbosity Bias	Criteria-specific rubrics

Pairwise Comparison Protocol

•First pass: Response A first, Response B second
•Second pass: Response B first, Response A second
•Consistency check: If passes disagree, return TIE
•Final verdict: Consistent winner with averaged confidence

Rubric Components

•Level descriptions: Clear boundaries for each score
•Characteristics: Observable features per level
•Examples: Representative text (optional but valuable)
•Edge cases: Guidance for ambiguous situations

Decision Framework

code

Is there objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Is it preference/quality judgment?
    ├── Yes → Pairwise Comparison (tone, creativity)
    └── No → Reference-based evaluation

Guidelines

•Always require justification before scores (15-25% reliability improvement)
•Always swap positions in pairwise comparison
•Match scale granularity to rubric specificity
•Separate objective and subjective criteria
•Include confidence scores calibrated to evidence strength

Notes

•Chain-of-thought prompting improves evaluation reliability
•Single-pass pairwise comparison is corrupted by position bias
•Validate automated evaluation against human judgments

Source: muratcankoylan/Agent-Skills-for-Context-Engineering