Skill: Fix Evaluation Framework Bugs
Overview
| Field | Value |
|---|---|
| Date | 2026-01-18 |
| Category | Debugging / Evaluation |
| Objective | Fix three critical E2E evaluation framework bugs causing false negative agent scores |
| Outcome | ✓ Success - All bugs fixed, CI passing, framework now scores agents correctly |
| Session ID | skill/evaluation/fix-judge-file-access |
When to Use This Skill
Use this debugging pattern when:
- •Agents penalized for framework issues - Scores deducted for files/formatting the agent didn't create
- •Intermittent FileNotFoundError in parallel execution contexts
- •Pre-commit failures on framework-generated files - CLAUDE.md, config files
- •Judge sees test configuration modifications - Agent evaluated on framework changes
- •Markdown lint violations in generated files - Framework creates invalid markup
Trigger Conditions
- •Error:
FileNotFoundError: 'results/.../T3/best_subtest.json' - •Judge deducting points for CLAUDE.md formatting (R014: -1.0, R010: -0.5, R008: -0.5)
- •Pre-commit hooks failing:
MD040/fenced-code-language,MD022,MD032,MD047 - •Git diff showing CLAUDE.md modifications in agent evaluations
Problem Patterns
Pattern 1: Directory Assignment ≠ Directory Creation
Symptom: Intermittent FileNotFoundError during parallel execution
Root Cause: Path assigned but mkdir() never called
# VULNERABLE tier_dir = self.experiment_dir / tier_id.value # ... later ... save_selection(selection, str(tier_dir / "best_subtest.json")) # FAILS
Pattern 2: Framework Files in Judge Patchfile
Symptom: Agents penalized for CLAUDE.md modifications they didn't make
Root Cause: _get_patchfile() includes all files without filtering test config
# VULNERABLE subprocess.run(["git", "diff"], ...) # Includes CLAUDE.md
Pattern 3: Invalid Framework-Generated Markdown
Symptom: Pre-commit failures on framework-created files
Root Cause: Missing blank lines, no newline at EOF
# INVALID (missing blank lines) Use the following sub-agent to solve this task: - chief-architect ## Cleanup Requirements - Remove temporary files...
Verified Workflow
Step 1: Identify Framework vs Agent Issues
Analyze judge output to separate concerns:
# Check judge output for CLAUDE.md references grep -i "claude.md\|r014\|format.*fail" judge_output.log # Check if CLAUDE.md was in patchfile grep -A 10 "patchfile" judge_output.log | grep -i claude
Key questions:
- •Did the agent create/modify CLAUDE.md? → NO = framework bug
- •Did formatting fail on framework files? → YES = framework bug
- •Is error intermittent/timing-dependent? → YES = likely framework race condition
Step 2: Fix Directory Creation Race Condition
File: scylla/e2e/runner.py
Location: Immediately after directory path assignment
# Find assignment tier_dir = self.experiment_dir / tier_id.value # Add mkdir immediately after tier_dir.mkdir(parents=True, exist_ok=True)
Why this works:
- •
parents=True- Creates parent directories if needed - •
exist_ok=True- Idempotent (no error if already exists) - •Immediate creation - No gap for race conditions
Step 3: Filter Framework Files from Patchfile
File: scylla/e2e/llm_judge.py
Location: _get_patchfile() function
# Before ["git", "diff"] ["git", "diff", "--cached"] # After - Use git pathspec exclusion ["git", "diff", "--", ".", ":(exclude)CLAUDE.md", ":(exclude).claude"] ["git", "diff", "--cached", "--", ".", ":(exclude)CLAUDE.md", ":(exclude).claude"]
Why this works:
- •Git pathspec syntax excludes files at pattern-match level
- •Applies to both staged and unstaged changes
- •Judge never sees framework modifications
Step 4: Generate Valid Markdown
File: scylla/e2e/tier_manager.py
Location: build_resource_suffix() method
# Before (invalid markdown)
suffixes.append(f"{prefix}\n{bullet_list}")
cleanup = "\n\n## Cleanup Requirements\n..."
# After (valid markdown)
suffixes.append(f"{prefix}\n\n{bullet_list}") # Blank line after heading
cleanup = "\n\n## Cleanup Requirements\n\n..." # Blank line before bullets
cleanup += "...\n" # Newline at EOF
Markdown rules fixed:
- •MD022: Blank line after heading before content
- •MD032: Blank line before bullet list
- •MD047: Newline at end of file
Step 5: Update Unit Tests
File: tests/unit/e2e/test_tier_manager.py
# Update CLEANUP_INSTRUCTIONS constant
CLEANUP_INSTRUCTIONS = (
"\n\n## Cleanup Requirements\n\n" # Added \n\n
"- Remove any temporary files...\n"
"- Clean up after yourself...\n" # Added \n
)
# Update all test expectations
expected = (
"Use the following tool:\n\n" # Added \n\n
"- Read" + CLEANUP_INSTRUCTIONS
)
Step 6: Verify All Fixes
# 1. Run unit tests locally
pixi run pytest tests/unit/e2e/test_tier_manager.py -v
# 2. Verify tier directory creation
pixi run python scripts/run_e2e_experiment.py --tiers T0 T1 T2 T3 --runs 1 --max-subtests 2
# 3. Check CLAUDE.md formatting
pixi run python << 'EOF'
from scylla.e2e.tier_manager import TierManager
from scylla.e2e.models import SubTestConfig
from pathlib import Path
manager = TierManager(Path("."))
subtest = SubTestConfig(
id="01", name="test", description="test",
resources={"agents": {"names": ["chief-architect"]}}
)
result = manager.build_resource_suffix(subtest)
# Verify formatting
assert "\n\n-" in result # Blank line before bullets
assert result.endswith("\n") # Newline at EOF
print("✅ CLAUDE.md formatting valid")
EOF
# 4. Verify patchfile exclusion
pixi run python << 'EOF'
import tempfile, subprocess
from pathlib import Path
import sys; sys.path.insert(0, 'src')
from scylla.e2e.llm_judge import _get_patchfile
with tempfile.TemporaryDirectory() as tmpdir:
workspace = Path(tmpdir)
subprocess.run(["git", "init"], cwd=workspace, capture_output=True)
subprocess.run(["git", "config", "user.name", "Test"], cwd=workspace, capture_output=True)
subprocess.run(["git", "config", "user.email", "test@test.com"], cwd=workspace, capture_output=True)
(workspace / "file.txt").write_text("test\n")
(workspace / "CLAUDE.md").write_text("config\n")
subprocess.run(["git", "add", "."], cwd=workspace, capture_output=True)
subprocess.run(["git", "commit", "-m", "initial"], cwd=workspace, capture_output=True)
(workspace / "file.txt").write_text("modified\n")
(workspace / "CLAUDE.md").write_text("modified config\n")
patchfile = _get_patchfile(workspace)
assert "CLAUDE.md" not in patchfile
assert "file.txt" in patchfile
print("✅ CLAUDE.md excluded from patchfile")
EOF
Failed Attempts
None - All solutions worked on first try.
Why Solutions Worked Immediately
- •Clear error messages - Stack traces pointed to exact locations
- •Pattern recognition - Similar bugs seen in other contexts
- •Test-driven validation - Unit tests caught issues before CI
- •Double protection strategy - Filter + format ensures no single point of failure
Key Insights
Critical Understandings
- •
Framework bugs masquerade as agent failures
- •Judge sees all changes, including framework-managed files
- •Agents penalized for things they didn't do
- •False negatives harm evaluation validity
- •
Python pathlib behavior
- •
Path()assignment does NOT create directories - •Always call
.mkdir(parents=True, exist_ok=True)immediately - •Race conditions expose these bugs in parallel execution
- •
- •
Markdown formatting rules matter
- •Framework-generated content must pass same checks as agent code
- •Blank lines after headings (MD022)
- •Blank lines before lists (MD032)
- •Newline at EOF (MD047)
- •
Double protection strategy
- •Filter CLAUDE.md from patchfile (primary)
- •Generate valid markdown (secondary)
- •If one fails, the other catches it
Best Practices
# ALWAYS: Create directory immediately after assignment directory_path = parent_path / "subdir" directory_path.mkdir(parents=True, exist_ok=True) # ALWAYS: Exclude framework files from git operations ["git", "diff", "--", ".", ":(exclude)CLAUDE.md", ":(exclude).claude"] # ALWAYS: Generate valid markdown from framework content = f"# Heading\n\n- Bullet item\n" # Blank lines + EOF newline
Results & Verification
Fixes Applied
| Bug | File | Line | Fix |
|---|---|---|---|
| Directory not created | runner.py | 625 | tier_dir.mkdir(parents=True, exist_ok=True) |
| CLAUDE.md in patchfile | llm_judge.py | 683, 693 | :(exclude)CLAUDE.md pathspec |
| Invalid markdown | tier_manager.py | 620, 638, 652, 670, 680-683 | Add \n\n and \n |
| Test expectations | test_tier_manager.py | 13-17, 50, 66, 97, 111 | Update to match format |
Commit History
Branch: skill/evaluation/fix-judge-file-access
- •beb8ed7:
fix(e2e): create tier directory before writing best_subtest.json - •8d4a9d0:
fix(judge): exclude CLAUDE.md and .claude/ from patchfile - •bbf8f5b:
fix(tier-manager): generate properly formatted CLAUDE.md - •673467d:
test(tier-manager): update tests for properly formatted CLAUDE.md
Impact
Before fixes:
- •Intermittent FileNotFoundError in parallel tier execution
- •Agents incorrectly penalized -2.0 points for framework issues
- •Pre-commit hooks failed on framework-generated files
- •Judge evaluated agents on framework modifications
After fixes:
- •✅ Tier directories always created before file writes
- •✅ Judge never sees CLAUDE.md changes (double protection)
- •✅ Framework generates valid markdown
- •✅ Agents evaluated only on their actual work
- •✅ All CI tests passing
Verification Commands
# Run the experiment that previously failed
pixi run python scripts/run_e2e_experiment.py \
--tiers-dir tests/fixtures/tests/test-002 \
--tiers T0 T1 T2 T3 T4 T5 T6 \
--runs 1 --max-subtests 2 -v --fresh
# Check for FileNotFoundError
grep -i "filenotfounderror" experiment.log
# Expected: No matches
# Check tier directories created
ls -la results/experiment/T{0,1,2,3,4,5,6}/
# Expected: All directories exist
# Verify CLAUDE.md excluded from patchfile
# (Run unit test)
pixi run pytest tests/unit/e2e/test_tier_manager.py -v
# Expected: All tests pass
Reusability
This pattern applies to any evaluation framework where:
- •Framework manages test configuration - Files like CLAUDE.md, .claude/, settings.json
- •Parallel execution with checkpoint/resume - Race conditions in directory creation
- •LLM judges evaluate git diffs - Must filter framework-managed files
- •Framework generates content - Must follow same quality standards as agent output
- •Pre-commit hooks run on all changes - Including framework-generated files
Common Locations
- •Result directory setup in parallel executors
- •Judge patchfile generation
- •Test configuration composition
- •Markdown/config file generation
- •Any git diff-based evaluation
Related Issues
- •Parallel execution race conditions
- •Framework vs agent responsibility boundaries
- •Test configuration file handling
- •Git pathspec exclusion patterns
- •Markdown formatting validation
- •Double protection strategies for robustness
Prevention Checklist
Add to framework code review:
- • Every directory path assignment followed by
.mkdir(parents=True, exist_ok=True)? - • All git diff operations exclude framework files with pathspec?
- • All framework-generated content passes pre-commit hooks?
- • Unit tests verify framework-generated content format?
- • Judge only sees agent-created changes, not framework config?