Skill: E2E Evaluation Report Fixes
| Attribute | Value |
|---|---|
| Date | 2026-01-10 |
| Objective | Fix critical issues in E2E evaluation reports and add judge model validation |
| Outcome | ✅ Success - All P0/P1 issues resolved |
| Context | ProjectScylla E2E evaluation framework |
When to Use This Skill
Use this skill when encountering these symptoms in E2E evaluation reports:
- •UnboundLocalError crashes during experiment execution
- •Workspace files not showing in reports (empty "Workspace State" section)
- •Judge models failing with invalid model IDs
- •Broken links in reports pointing to non-existent files
- •Missing timing data or timing files being overwritten
- •Inconsistent judge scoring across runs with identical outcomes
Problem Summary
Issues Fixed
| Priority | Issue | Impact |
|---|---|---|
| P0 | UnboundLocalError: cannot access local variable 'json' | Experiment crashes immediately |
| P0 | Workspace detection broken - only shows README | Reports incomplete, missing actual deliverables |
| P0 | Invalid judge model IDs cause fallback scores | Incorrect evaluation results |
| P1 | Judge timing overwritten (parent dir instead of per-judge) | Loss of per-judge performance data |
| P1 | Broken result.json links in reports | User confusion, broken navigation |
Verified Workflow
Fix 1: UnboundLocalError in Timing Persistence
Problem: import json inside conditional block, but json.dump() called outside it
Solution: Move import to method level
# In scylla/e2e/subtest_executor.py
def _execute_single_run(self, ...):
import json # ← Move to top of method
# Track execution timing
subtest_id = f"{tier_id.value}_{subtest.id}"
...
Why it works: Python scoping - imports inside if blocks are only available within that block.
Fix 2: Workspace File Detection
Problem: Using only git diff HEAD~1 or git ls-files missed uncommitted files
Solution: Combine two approaches with status indicators
def _get_workspace_files(workspace_path: Path) -> list[tuple[str, str]]:
files_with_status = []
# 1. Get committed files via git diff
result = subprocess.run(
["git", "diff", "--name-only", "HEAD~1", "HEAD"],
cwd=workspace_path, ...
)
if result.returncode == 0:
for line in result.stdout.strip().split("\n"):
if file_path and not _is_test_config_file(file_path):
files_with_status.append((file_path, "committed"))
# 2. Get untracked/modified files via git status
result = subprocess.run(
["git", "status", "--porcelain"],
cwd=workspace_path, ...
)
if result.returncode == 0:
for line in result.stdout.strip().split("\n"):
file_path = line[3:].strip() # Skip status code
if file_path and not _is_test_config_file(file_path):
# Skip if already added as committed
if not any(f[0] == file_path for f in files_with_status):
files_with_status.append((file_path, "uncommitted"))
return sorted(files_with_status, key=lambda x: x[0])
Display in report:
for file_path, status in workspace_files:
status_indicator = "✓" if status == "committed" else "⚠"
lines.append(f"- [{file_path}](./workspace/{file_path}) {status_indicator} {status}")
Fix 3: Judge Model Validation
Problem: Invalid model IDs (opus-4-0, sonnet-4-0) caused judge failures
Solution: Validate models before experiment + add shortcuts
# In scripts/run_e2e_experiment.py
def validate_model(model_id: str) -> bool:
"""Test model availability with small prompt."""
try:
result = subprocess.run(
["claude", "--model", model_id, "--output-format", "json", "Say 'OK'"],
capture_output=True,
text=True,
timeout=10,
)
return result.returncode == 0
except (subprocess.TimeoutExpired, FileNotFoundError, Exception):
return False
# Add model shortcuts
shortcuts = {
# Claude 4.5 models
"opus-4-5": "claude-opus-4-5-20251101",
"sonnet-4-5": "claude-sonnet-4-5-20250929",
"haiku-4-5": "claude-haiku-4-5", # Correct ID
# Claude 4.0 models
"opus-4-0": "claude-opus-4-20250514",
"sonnet-4-0": "claude-sonnet-4-20250514",
"haiku-4-0": "claude-haiku-4-0-20250514",
}
# Validate before adding to config
for model in args.add_judge:
resolved_model = resolve_judge_model(model)
logger.info(f"Validating judge model: {resolved_model}")
if not validate_model(resolved_model):
logger.warning(f"⚠️ Judge model '{resolved_model}' not available. Skipping.")
continue
config_dict["judge_models"].append(resolved_model)
Fix 4: Per-Judge Timing Files
Problem: Single judge/timing.json overwritten by each judge
Solution: Write timing to each judge's subdirectory
# In scylla/e2e/llm_judge.py
def run_llm_judge(..., judge_run_number: int = 1):
import json
import time
from datetime import UTC, datetime
judge_start = time.time()
# Create judge_{N}/ subdirectory
actual_judge_dir = judge_dir / f"judge_{judge_run_number:02d}"
actual_judge_dir.mkdir(parents=True, exist_ok=True)
# ... run judge ...
# Write per-judge timing
if actual_judge_dir:
judge_duration = time.time() - judge_start
timing_file = actual_judge_dir / "timing.json"
with open(timing_file, "w") as f:
json.dump({
"judge_duration_seconds": judge_duration,
"measured_at": datetime.now(UTC).isoformat(),
}, f, indent=2)
Result: judge/judge_01/timing.json, judge/judge_02/timing.json, etc.
Fix 5: Remove Broken result.json Links
Problem: Reports linked to ./judge/judge_01/result.json but file doesn't exist
Solution: Remove result.json links, keep only judgment.json
# In scylla/e2e/run_report.py
# OLD:
f"- [View judgment](./judge/judge_{judge.judge_number:02d}/judgment.json)",
f"- [View result JSON](./judge/judge_{judge.judge_number:02d}/result.json)",
# NEW:
f"- [View judgment](./judge/judge_{judge.judge_number:02d}/judgment.json)",
# (removed result.json link)
Fix 6: Tiered Deduction Guidelines
Problem: Inconsistent scoring for identical flaws (e.g., pycache)
Solution: Add calibrated deduction tiers to judge system prompt
### Subjective Scoring: Deduction Guidelines #### Tier: Tiny (0.0 - 0.1 points) Issues outside agent's direct control: - `__pycache__` directory (Python runtime artifact) - `.pyc` files generated by interpreter Example: "Agent left __pycache__/" → -0.05 to -0.1 #### Tier: Small (0.1 - 0.2 points) Minor oversights that don't affect functionality: - Missing trailing newline - Verbose but correct solution Example: "Code works but uses 8 lines where 3 would suffice" → -0.15 [... Medium, Large, XLarge, Catastrophic tiers ...] **Important**: Environmental artifacts should be treated as Tiny issues at most.
Failed Attempts
❌ Attempt 1: Using only git diff HEAD~1
What we tried: Compare HEAD against previous commit to find agent-created files
result = subprocess.run(
["git", "diff", "--name-only", "HEAD~1", "HEAD"],
cwd=workspace_path, ...
)
Why it failed: Agent didn't commit files - they remained untracked. git diff only shows committed changes between refs, not untracked files.
Lesson: Need both committed (git diff) AND uncommitted (git status) detection.
❌ Attempt 2: Using only git ls-files
What we tried: List all tracked files as fallback
result = subprocess.run(["git", "ls-files"], cwd=workspace_path, ...)
Why it failed: Only shows tracked files. Untracked files like hello.py were invisible.
Lesson: Must use git status --porcelain to find untracked files.
❌ Attempt 3: Assuming Haiku 4.5 doesn't exist
What we tried: Mapped haiku-4-5 → claude-haiku-4-0-20250514 with warning
"haiku-4-5": "claude-haiku-4-0-20250514", # Note: Haiku 4.5 doesn't exist yet
if model_shorthand == "haiku-4-5":
logger.warning("Haiku 4.5 doesn't exist yet. Using Haiku 4.0 instead.")
Why it failed: Haiku 4.5 DOES exist as claude-haiku-4-5 (released recently)
Lesson: Always verify model IDs from official documentation before hardcoding assumptions.
❌ Attempt 4: Keeping unused status_code variable
What we tried: Parse git status with status code extraction
status_code = line[:2] # XY status codes file_path = line[3:].strip()
Why it failed: Variable assigned but never used - pre-commit hook caught it with ruff
Lesson: Don't extract data you don't need. Status indicators come from separate logic.
Results & Parameters
Model ID Mappings (Copy-Paste Ready)
shortcuts = {
# Claude 4.5 models
"opus-4-5": "claude-opus-4-5-20251101",
"sonnet-4-5": "claude-sonnet-4-5-20250929",
"haiku-4-5": "claude-haiku-4-5",
# Claude 4.0 models
"opus-4-0": "claude-opus-4-20250514",
"sonnet-4-0": "claude-sonnet-4-20250514",
"haiku-4-0": "claude-haiku-4-0-20250514",
}
Validation Timeout
timeout=10 # seconds for model validation test
Test Command
pixi run python scripts/run_e2e_experiment.py \ --tiers-dir tests/fixtures/tests/test-001 \ --tiers T0 --max-subtests 1 \ --add-judge sonnet-4-5 --add-judge haiku-4-5 \ --runs 3 --parallel 7 -v
Expected Outcomes
| Metric | Before | After |
|---|---|---|
| Crashes | UnboundLocalError immediately | ✅ No crashes |
| Workspace detection | Only README shown | ✅ Shows hello.py + status |
| Invalid judges | 3/5 fallback scores (0.7) | ✅ All valid or skipped |
| Timing files | Overwritten parent file | ✅ Per-judge files |
| Broken links | result.json 404s | ✅ No broken links |
| Score variance | 0.90, 0.89, 0.85 (inconsistent) | ✅ Target < 0.03 |
Key Insights
- •Python scoping matters: Imports inside conditional blocks don't leak to outer scope
- •Git has two file states: Committed (tracked) and uncommitted (untracked/modified)
- •Model validation prevents silent failures: Test before experiment, not during
- •Per-resource timing prevents overwrites: Use subdirectories for parallel resources
- •Tiered guidelines reduce variance: Anchored deductions improve judge consistency
Related Files
- •
scylla/e2e/subtest_executor.py- Main executor with timing persistence - •
scylla/e2e/run_report.py- Report generation with workspace detection - •
scylla/e2e/llm_judge.py- Judge runner with per-judge timing - •
scripts/run_e2e_experiment.py- Experiment orchestrator with validation - •
config/judge/system_prompt.md- Judge guidelines with tiered deductions
References
- •PR: #172 on HomericIntelligence/ProjectScylla
- •Commit:
476af08- fix(evaluation): Fix report issues and add judge model validation - •Haiku 4.5 announcement: https://www.anthropic.com/news/claude-haiku-4-5