Skill: Hybrid LLM Judge with Granular Scoring

Overview

Attribute	Value
Date	2026-01-10
Objective	Reduce LLM judge variance while maintaining quality assessment through hybrid evaluation
Outcome	✅ Variance reduced from 14% → 6%, granular scoring working, multi-run reporting added
Primary Innovation	Hybrid 80/20 split: checklist (objective) + subjective (engineering judgment)
Key Metric	Achieved <10% variance target for identical implementations

When to Use This Skill

Use this approach when you need to:

•Reduce judge variance - Identical outputs getting different scores (>10% variance)
•Balance objectivity with judgment - Need both measurable criteria AND quality assessment
•Enable continuous scoring - Want fine-grained scores (0.82, 1.7) not just discrete (0, 0.5, 1)
•Support multi-run analysis - Need grade distribution and consistency metrics
•Maintain transparency - Require category-level score breakdown

Do NOT use when:

•Tasks are purely objective (use pure checklist)
•Variance doesn't matter (simple pass/fail sufficient)
•No engineering judgment needed (automated tests work)

Verified Workflow

1. Design Hybrid Rubric (80% Checklist / 20% Subjective)

File: tests/fixtures/tests/{task}/expected/rubric.yaml

yaml

categories:
  functional:
    weight: 0.35
    scoring_type: "checklist"  # Objective, measurable
    items:
      - id: F1
        check: "File hello.py exists in workspace root"
        points: 1.0

      - id: F2
        check: "Running `python hello.py` produces correct output"
        points: 1.0

  code_quality:
    weight: 0.20
    scoring_type: "checklist"
    items:
      - id: Q1
        check: "Python syntax is valid"
        points: 1.0

      - id: Q2
        check: "Code is idiomatic Python"
        points: 1.0

  overall_quality:
    weight: 0.20  # Subjective component
    scoring_type: "subjective"
    items:
      - id: OQ1
        check: "Overall engineering judgment: appropriateness, maintainability, clarity"
        points: 2.0  # Larger scale for granularity

grading:
  pass_threshold: 0.60
  grade_scale:
    S: 0.95
    A: 0.80
    B: 0.60
    C: 0.40
    D: 0.20
    F: 0.0

Key Principles:

•Checklist categories: Binary/near-binary (80% total weight)
•Subjective categories: Continuous scale with 2.0+ points (20% total weight)
•N/A conditions for environmental factors

2. Update Judge System Prompt

File: config/judge/system_prompt.md

Critical sections:

markdown

## Evaluation Methodology

### Two Scoring Types

1. **Checklist Categories** (`scoring_type: "checklist"`)
   - Award ANY value between 0 and max
   - Proportional to satisfaction: 0.3, 0.7, 0.85, etc.
   - NOT limited to 0, 0.5, 1.0

2. **Subjective Categories** (`scoring_type: "subjective"`)
   - Continuous scale (e.g., 0.0 to 2.0)
   - Anchored examples at multiple levels
   - Engineering judgment

### Anchored Examples (max = 2.0)

- **2.0**: Exceptional - perfectly appropriate
- **1.7**: Excellent - minor improvements possible
- **1.4**: Good - solid with some complexity
- **1.0**: Acceptable - functional with concerns
- **0.6**: Marginal - significant issues
- **0.3**: Poor - barely functional
- **0.0**: Unacceptable - broken

Why anchors matter: Without them, judges default to 0/0.5/1 discrete scoring.

3. Implement Grade Aggregation (Multi-Run)

File: scylla/e2e/models.py

Add to SubTestResult:

python

@dataclass
class SubTestResult:
    # ... existing fields ...
    grade_distribution: dict[str, int] | None = None  # {"A": 8, "B": 2}
    modal_grade: str | None = None
    min_grade: str | None = None
    max_grade: str | None = None

File: scylla/e2e/subtest_executor.py

In _aggregate_results():

python

# Aggregate grades
grades = [r.judge_grade for r in runs if r.judge_grade]
if grades:
    # Distribution
    grade_distribution = {}
    for g in grades:
        grade_distribution[g] = grade_distribution.get(g, 0) + 1

    # Modal (most common)
    modal_grade = max(grade_distribution, key=grade_distribution.get)

    # Range (F=worst, S=best)
    grade_order = ["F", "D", "C", "B", "A", "S"]
    grade_indices = [grade_order.index(g) for g in grades if g in grade_order]
    if grade_indices:
        min_grade = grade_order[min(grade_indices)]
        max_grade = grade_order[max(grade_indices)]

4. Update Reports

File: scylla/e2e/run_report.py

Add after runs table:

python

if result.grade_distribution:
    grade_order = ["S", "A", "B", "C", "D", "F"]
    sorted_dist = sorted(
        result.grade_distribution.items(),
        key=lambda x: grade_order.index(x[0]) if x[0] in grade_order else 99,
    )
    dist_str = ", ".join(f"{g}={c}" for g, c in sorted_dist)
    md_lines.append(f"**Distribution**: {dist_str}")
    md_lines.append(f"**Modal Grade**: {result.modal_grade}")
    md_lines.append(f"**Grade Range**: {result.min_grade} - {result.max_grade}")

5. Test and Calibrate

bash

# Run with multiple attempts to test variance
pixi run python scripts/run_e2e_experiment.py \
  --tiers-dir tests/fixtures/tests/test-001 \
  --tiers T0 \
  --runs 3 \
  --parallel 3

# Check variance
jq '.summary | {mean: .mean_score, median: .median_score, std_dev: .std_dev, distribution: .grade_distribution}' \
  results/latest/T0/*/report.json

# Target: std_dev < 0.03 (3% variance)

Failed Attempts & Lessons Learned

❌ Pure Checklist Approach

What we tried: 100% checklist with discrete 0/0.5/1 scoring

Why it failed:

•Too rigid - couldn't capture overall code quality
•Environmental noise penalized agents (pre-commit hooks, pycache)
•No engineering judgment capability

Lesson: Need subjective component for holistic quality assessment

❌ Pure Subjective Grading

What we tried: Vague criteria like "code quality: 0-1 score"

Why it failed:

•High variance (14% for identical outputs)
•Non-deterministic weights per criterion
•No reference points for scores

Lesson: Need objective anchors to reduce variance

❌ Discrete Scoring Only (0, 0.5, 1)

What we tried: Limited judges to three score values

Why it failed:

•Lost granularity - "good but not perfect" collapsed to 0.5
•Couldn't distinguish quality levels
•Partial credit impossible

Lesson: Continuous scoring with anchored examples essential

⚠️ Initial Weight Distribution

What we tried: 50/50 checklist/subjective split

Why suboptimal:

•Too much subjective weight increased variance
•80/20 split found to be optimal balance

Lesson: Majority checklist (objective) with minority subjective (judgment)

Results & Key Parameters

Variance Reduction

Metric	Before	After	Target
Variance (identical outputs)	14%	6%	<10%
Score range	0.74-0.88	0.81-0.87	±0.05
Scoring resolution	3 values	Continuous	>10 distinct

Test Results (test-001, 24 subtests)

Fractional scores observed:

•Checklist: 0.8196, 0.8625, 0.867
•Subjective: 1.0, 1.7, 1.8 (on 2.0 scale)

Grade distribution (3 runs):

code

Distribution: A=3
Modal Grade: A
Grade Range: A - A

Critical Configuration Values

Weight Distribution:

yaml

functional: 0.35        # Core functionality
code_quality: 0.20      # Objective quality
proportionality: 0.15   # Scope appropriateness
build_pipeline: 0.10    # CI/CD compliance
overall_quality: 0.20   # Subjective judgment ← KEY

Subjective Scale:

•Use 2.0+ points (not 1.0)
•Provides granularity: 1.7 ≠ 1.8 ≠ 2.0

Grade Ordering:

python

["F", "D", "C", "B", "A", "S"]  # For min/max calculation

Implementation Checklist

When implementing this system:

Files Modified

Core Evaluation:

•config/judge/system_prompt.md - Judge instructions
•tests/fixtures/tests/*/expected/rubric.yaml - Task-specific rubrics

Models & Logic:

•scylla/e2e/models.py - Added grade fields to SubTestResult
•scylla/e2e/subtest_executor.py - Grade aggregation logic
•scylla/e2e/llm_judge.py - Rubric path parameter, parsing

Reporting:

•scylla/e2e/run_report.py - Grade statistics display

References

•Anthropic: Demystifying Evals for AI Agents
•Anthropic recommendation: Combine multiple grader types (code-based + model-based)
•ProjectScylla implementation: docs/eval_hybrid_approach.md
•Test results: results/2026-01-10T15-08-06-test-001/ (24 subtests, 6% variance)

Quick Start

bash

# 1. Design rubric with scoring_type fields
vim tests/fixtures/tests/my-task/expected/rubric.yaml

# 2. Update system prompt if needed (already done in this implementation)
# No action required - config/judge/system_prompt.md already updated

# 3. Run evaluation with multiple attempts
pixi run python scripts/run_e2e_experiment.py \
  --tiers-dir tests/fixtures/tests/my-task \
  --runs 5 \
  --tiers T0

# 4. Check grade statistics
cat results/latest/T0/*/report.md | grep -A 5 "Grade Statistics"

# 5. Verify variance
jq '.summary | {std_dev, distribution: .grade_distribution}' \
  results/latest/T0/*/report.json

Troubleshooting

Problem: Still seeing only 0/0.5/1 scores

Solution: Check system prompt includes "Award ANY value between 0 and max"

Problem: High variance still present (>10%)

Solutions:

•Increase checklist weight (reduce subjective %)
•Add more anchored examples
•Make checklist criteria more objective
•Check for environmental factors not marked N/A

Problem: Grade distribution not showing

Solution: Ensure grade_distribution field added to SubTestResult and passed from _aggregate_results()

Problem: Judges still using discrete 0/0.5/1

Solution: Add explicit instruction: "Do not limit yourself to 0, 0.5, or 1.0" in system prompt

Success Indicators

✅ Variance < 10% for identical outputs ✅ Fractional scores appearing (0.82, 1.7, not just 0.5/1.0) ✅ Grade distribution showing counts ✅ Subjective category uses 2.0 scale with values like 1.7, 1.8 ✅ N/A items excluded from scoring ✅ Backward compatible (old judgments still parse)

Related Skills

•evaluation/llm-judge-calibration - Calibrating judges against human experts
•evaluation/variance-analysis - Analyzing and debugging score variance
•evaluation/rubric-design - Designing effective evaluation rubrics