Evaluate Benchmark Attempt
Overview
This SOP evaluates a completed brazil-bench attempt against the spec.md requirements, capturing metrics for comparison across orchestration patterns. Supports both Python and Swift/iOS implementations.
Parameters
- •attempt_repo (required): Repository name (e.g.,
attempt-3) - •output_dir (optional, default:
./results): Where to write evaluation results
Steps
0. Detect Project Language
Identify the primary language/platform of the implementation.
Detection Commands:
cd ./reviews/{attempt_repo}
# Python indicators
ls pyproject.toml setup.py requirements.txt 2>/dev/null
# Swift/iOS indicators
ls Package.swift *.xcodeproj *.xcworkspace 2>/dev/null
# Check file extensions
find . -name "*.py" -not -path "./.venv/*" | head -5
find . -name "*.swift" | head -5
Language Detection Matrix:
| Files Found | Language | Test Framework |
|---|---|---|
pyproject.toml, *.py | Python | pytest |
Package.swift, *.swift | Swift Package | swift test |
*.xcodeproj, *.swift | iOS/Xcode | xcodebuild test |
| Both Python and Swift | Multi-language | Run both |
Constraints:
- •You MUST detect the language before running tests
- •You MUST use appropriate commands for the detected language
- •You SHOULD note the detected language in the report
1. Clone Attempt
Fetch the attempt repository for local analysis.
Constraints:
- •You MUST clone into
./reviews/{attempt_repo} - •You MUST verify the clone succeeded before proceeding
- •You MUST NOT modify any files in the cloned repo
gh repo clone brazil-bench/{attempt_repo} ./reviews/{attempt_repo}
2. Verify Spec Integrity
Confirm the spec.md was not modified from the template.
Constraints:
- •You MUST compare
spec.mdagainst the template version - •You MUST fail the evaluation if spec.md was modified
- •You SHOULD use a checksum comparison
gh repo clone brazil-bench/benchmark-template ./reviews/_template --depth 1
diff ./reviews/{attempt_repo}/spec.md ./reviews/_template/spec.md
3. Run Conformance Tests
Execute the test suite defined in the spec against the implementation.
Constraints:
- •You MUST attempt to run all tests specified in spec.md
- •You MUST capture pass/fail counts and output
- •You SHOULD timeout tests after 60 seconds each
- •You MAY retry flaky tests once
- •If tests fail due to missing dependencies, follow the dependency resolution steps below
Python Test Commands
cd ./reviews/{attempt_repo}
# Run pytest with verbose output
pytest --tb=short -v 2>&1 | tee test_output.log
# Get summary counts
pytest --tb=no -q 2>&1 | tail -5
Swift/iOS Test Commands
cd ./reviews/{attempt_repo}
# Swift Package Manager
swift test 2>&1 | tee test_output.log
# Xcode project (iOS Simulator)
xcodebuild test \
-project *.xcodeproj \
-scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
2>&1 | tee test_output.log
# Parse xcodebuild results
grep -E "(Test Case|passed|failed)" test_output.log
# Using xcpretty for cleaner output (if available)
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
| xcpretty --report junit
3a. Handle Missing Dependencies (Neo4j, etc.)
If tests fail due to missing external dependencies like Neo4j:
Step 1: Try to start the dependency via Docker
# Check if Docker is available
docker --version
# Check for docker-compose files in the repo
ls ./reviews/{attempt_repo}/docker-compose*.yml
# If Neo4j docker-compose exists, start it
docker-compose -f ./reviews/{attempt_repo}/docker-compose.neo4j.yml up -d
# Or start Neo4j directly
docker run -d --name neo4j-eval -p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:5
# Wait for Neo4j to be ready
sleep 10
docker logs neo4j-eval 2>&1 | tail -5
Step 2: If Docker unavailable or fails, look for evidence of prior test runs
Check these sources for test results:
# Check git history for test-related commits
git log --oneline --all | grep -iE "(test|pass|100%|fix.*test)"
# Check for CI/CD logs or badges
cat ./reviews/{attempt_repo}/README.md | grep -iE "(pass|badge|ci|test)"
# Check prompts.txt for test execution evidence
cat ./reviews/{attempt_repo}/prompts.txt 2>/dev/null | grep -iE "(pytest|test|pass|fail|scenario)"
# Check for pytest cache with results
ls -la ./reviews/{attempt_repo}/.pytest_cache/ 2>/dev/null
# Check for coverage reports
ls -la ./reviews/{attempt_repo}/htmlcov/ ./reviews/{attempt_repo}/coverage.xml 2>/dev/null
Step 3: Document findings in the report
If tests cannot be run directly, document:
- •Why tests couldn't run (missing Neo4j, etc.)
- •Evidence found of prior test runs (commit messages, prompts.txt entries)
- •Claimed test results from the attempt's documentation
- •Mark as "CANNOT VERIFY" with explanation
Constraints for dependency handling:
- •You MUST try Docker first if available
- •You MUST search for evidence if Docker fails
- •You MUST NOT claim tests pass without verification
- •You SHOULD note the source of any claimed test results
- •You SHOULD clean up Docker containers after evaluation:
docker stop neo4j-eval && docker rm neo4j-eval
3b. Detect Skipped Tests
Skipped tests inflate test counts without providing actual verification. You MUST detect and report them separately.
Python: Detect Skipped Tests
Step 1: Run pytest with verbose output to capture skipped tests
cd ./reviews/{attempt_repo}
# Run pytest and capture skip count
pytest --tb=no -v 2>&1 | grep -E "(PASSED|FAILED|SKIPPED|ERROR)" | head -100
# Get summary counts
pytest --tb=no -q 2>&1 | tail -5
# Look for skip patterns in test files
grep -r "pytest.skip\|@pytest.mark.skip\|skipif\|xfail" tests/ --include="*.py"
Step 2: Analyze test files for skip patterns
# Count tests that call pytest.skip() inside the test body (worst pattern)
grep -r "pytest.skip(" tests/ --include="*.py" -l | wc -l
# Count tests with @pytest.mark.skip decorator
grep -r "@pytest.mark.skip" tests/ --include="*.py" | wc -l
# Count conditional skips (skipif)
grep -r "@pytest.mark.skipif" tests/ --include="*.py" | wc -l
Swift/iOS: Detect Skipped Tests
Step 1: Run swift test or xcodebuild and capture skipped tests
cd ./reviews/{attempt_repo}
# Swift Package Manager - look for skipped in output
swift test 2>&1 | grep -E "(passed|failed|skipped)"
# Xcode - parse test results
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
2>&1 | grep -E "Test Case.*passed|Test Case.*failed|skipped"
Step 2: Analyze test files for skip patterns
# Count XCTSkip usage (explicit skips) grep -r "XCTSkip\|throw XCTSkip" Tests/ --include="*.swift" | wc -l # Count disabled tests (func name doesn't start with test) grep -r "func disabled_test\|// func test" Tests/ --include="*.swift" | wc -l # Count tests with availability checks that skip grep -r "@available\|#available" Tests/ --include="*.swift" -A 2 | grep -i skip | wc -l # Look for conditional test execution grep -r "guard.*else.*return\|if.*XCTSkip" Tests/ --include="*.swift" | wc -l
Swift Skip Patterns:
| Pattern | Type | Assessment |
|---|---|---|
throw XCTSkip("reason") | Explicit skip | Acceptable if documented |
#if !targetEnvironment(simulator) | Conditional | Acceptable for device-only |
@available(iOS 16, *) | Version skip | Acceptable |
Renamed to disabled_testFoo | Hidden skip | Should be penalized |
| Empty test body | Stub | Should be penalized |
Step 3: Calculate effective test count
| Metric | How to Calculate |
|---|---|
| Total Tests | Number of test functions defined |
| Passed Tests | Tests that ran and passed |
| Skipped Tests | Tests marked skip or calling pytest.skip() |
| Effective Tests | Total - Skipped (tests that actually run) |
| Skip Ratio | Skipped / Total (percentage of tests that skip) |
Constraints for skipped test handling:
- •You MUST report skipped tests separately from passed tests
- •You MUST calculate the "effective test count" (passed + failed, excluding skipped)
- •You MUST flag ANY skipped tests for issue filing - zero tolerance for skips
- •You MUST distinguish between skip types for the issue description:
- •Conditional skips (
@pytest.mark.skipif): Document reason in issue - •Unconditional skips (
pytest.skip()in body): Critical - tests never run - •Decorator skips (
@pytest.mark.skip): Document reason in issue
- •Conditional skips (
- •You MUST NOT count skipped tests toward the test score in rankings
- •You MUST file an issue for ANY skipped test (no acceptable skip threshold)
Example Analysis:
Total tests: 59 Passed: 44 Skipped: 15 (25% skip ratio - HIGH) Failed: 0 Effective: 44 (use this for scoring, not 59) Skip breakdown: - pytest.skip() in body: 15 (integration tests that never run) - @pytest.mark.skipif: 0 - @pytest.mark.skip: 0 Flag: INFLATED TEST COUNT - 15 tests skip unconditionally
Document in Report:
## Test Results | Metric | Count | |--------|-------| | Total Tests | 59 | | Passed | 44 | | **Skipped** | **15** | | Failed | 0 | | **Effective Tests** | **44** | | Skip Ratio | 25% | ⚠️ **Warning:** 15 tests (25%) are skipped and never execute. These are integration tests that call `pytest.skip()` inside the test body. The effective test count for scoring is 44, not 59.
3c. Self-Contained Integration Tests (REQUIRED)
Integration tests MUST be self-contained and actually run. Tests that skip because "Neo4j not available" or similar are not acceptable.
Requirement: Integration tests must start their own data stores as needed.
Detection Commands:
cd ./reviews/{attempt_repo}
# Check for testcontainers usage (Python)
grep -r "testcontainers\|TestContainer\|DockerContainer" tests/ --include="*.py"
# Check for docker-compose in test setup
grep -r "docker-compose\|subprocess.*docker" tests/ --include="*.py"
# Check for pytest-docker fixture
grep -r "pytest-docker\|docker_compose" tests/ --include="*.py" pyproject.toml
# Check for in-memory alternatives (e.g., SQLite instead of Postgres)
grep -r "sqlite.*memory\|:memory:\|MockNeo4j\|FakeNeo4j" tests/ --include="*.py"
# Check for conftest fixtures that start services
grep -A 20 "@pytest.fixture" tests/conftest.py 2>/dev/null | grep -E "docker\|container\|start\|neo4j"
# Swift: Check for test containers
grep -r "Docker\|Container\|TestServer" Tests/ --include="*.swift"
Acceptable Patterns for Self-Contained Tests:
| Pattern | Example | Assessment |
|---|---|---|
| testcontainers | Neo4jContainer() in fixture | ✓ Best - automatic lifecycle |
| pytest-docker | docker_compose_file fixture | ✓ Good - compose-based |
| conftest startup | Fixture runs docker run neo4j | ✓ Acceptable - manual but works |
| In-memory mock | MockNeo4jClient class | ✗ NOT acceptable - not persistent |
| External dependency | pytest.skip("Neo4j not running") | ✗ NOT acceptable |
| CI-only tests | @pytest.mark.skipif(not CI) | ✗ NOT acceptable |
| No integration tests | No tests for data layer | ✗ NOT acceptable |
Example: testcontainers Pattern (Python)
# conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer
@pytest.fixture(scope="session")
def neo4j_container():
"""Start Neo4j container for integration tests."""
with Neo4jContainer("neo4j:5") as neo4j:
yield neo4j
@pytest.fixture
def neo4j_client(neo4j_container):
"""Get client connected to test container."""
return Neo4jClient(
uri=neo4j_container.get_connection_url(),
auth=("neo4j", "password")
)
Example: pytest-docker Pattern
# conftest.py
import pytest
@pytest.fixture(scope="session")
def docker_compose_file():
return "docker-compose.test.yml"
@pytest.fixture(scope="session")
def neo4j_service(docker_services):
"""Wait for Neo4j to be ready."""
docker_services.wait_until_responsive(
timeout=30.0,
pause=0.5,
check=lambda: is_neo4j_ready()
)
Scoring Impact:
| Integration Test Quality | Score Modifier |
|---|---|
| Self-contained (testcontainers/docker) | No penalty |
| In-memory mock (not persistent) | -10 points quality |
| Skips due to missing dependency | -10 points quality |
| No integration tests at all | -15 points quality |
Constraints:
- •You MUST check if integration tests are self-contained
- •You MUST flag tests that skip due to external dependencies
- •You MUST NOT accept "works on CI" as justification for skipping locally
- •You SHOULD recommend testcontainers or pytest-docker patterns
- •You SHOULD verify integration tests actually execute (not just exist)
Document in Report:
## Integration Test Quality | Aspect | Status | |--------|--------| | Self-contained | Yes/No | | Data store management | testcontainers / docker-compose / mock / external | | Integration tests run | X passed, Y skipped | ⚠️ **Issue:** Integration tests skip when Neo4j is not running. Tests should use testcontainers or pytest-docker to manage dependencies.
3d. Context Header Blocks (REQUIRED)
Every source code file MUST have a context header comment block that documents:
- •Purpose - What the file/module does
- •Interfaces - Key classes, functions, or APIs exposed
- •Change History - Record of modifications (updated on every change)
Detection Commands:
cd ./reviews/{attempt_repo}
# Python: Check for docstrings or header comments in source files
for f in $(find src -name "*.py" -not -name "__init__.py"); do
echo "=== $f ==="
head -50 "$f" | grep -E '""".*|^#.*Purpose|^#.*Context|CONTEXT BLOCK|Change History|Interfaces'
done
# Swift: Check for header comments
for f in $(find Sources -name "*.swift" 2>/dev/null); do
echo "=== $f ==="
head -50 "$f" | grep -E '///|/\*\*|Purpose|Context|History'
done
# Count files with context headers vs total
total=$(find src -name "*.py" -not -name "__init__.py" | wc -l)
with_header=$(find src -name "*.py" -not -name "__init__.py" -exec head -30 {} \; -exec echo "---" \; | grep -l "CONTEXT\|Purpose\|Module:" | wc -l)
echo "Files with headers: $with_header / $total"
Required Header Format (Python):
"""
================================================================================
CONTEXT BLOCK
================================================================================
File: {filename}
Module: {module.path}
Purpose: {one-line description}
Description:
{detailed description of what this module does}
Interfaces:
- {ClassName}: {brief description}
- {function_name}(): {brief description}
Dependencies:
- {module}: {why needed}
Change History:
- {date}: {description of change}
- {date}: Initial creation
================================================================================
"""
Required Header Format (Swift):
//
// {FileName}.swift
// {ProjectName}
//
// Purpose: {one-line description}
//
// Interfaces:
// - {ClassName}: {brief description}
// - {functionName}(): {brief description}
//
// Change History:
// - {date}: {description of change}
// - {date}: Initial creation
//
Assessment Criteria:
| Coverage | Assessment | Score Impact |
|---|---|---|
| 100% files have headers | Excellent | No penalty |
| 75-99% files have headers | Good | -2 quality |
| 50-74% files have headers | Partial | -5 quality |
| <50% files have headers | Poor | -10 quality |
Constraints:
- •You MUST check all source files for context headers
- •You MUST verify headers include purpose, interfaces, and change history
- •You MUST flag files missing headers for issue filing
- •You SHOULD note which files have incomplete headers (missing sections)
Document in Report:
## Context Header Compliance
| Metric | Count |
|--------|-------|
| Source files | X |
| With headers | Y |
| Coverage | Z% |
### Files Missing Headers
- `src/module.py` - No header
- `src/utils.py` - Missing change history
### Assessment
{Excellent/Good/Partial/Poor} - {X}% coverage
4. Measure Code Metrics
Collect quantitative data about the implementation.
Constraints:
- •You MUST capture: total lines of code, number of files, dependencies
- •You SHOULD capture: cyclomatic complexity, test coverage
- •You MAY capture: documentation coverage, type hint coverage
Python Metrics
# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/src -name "*.py" | xargs wc -l
# Dependencies
cat ./reviews/{attempt_repo}/pyproject.toml | grep dependencies -A 50
# File count
find ./reviews/{attempt_repo}/src -name "*.py" | wc -l
Swift/iOS Metrics
# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/Sources -name "*.swift" | xargs wc -l
# For Xcode projects
find ./reviews/{attempt_repo} -name "*.swift" -not -path "*/Tests/*" -not -path "*Test*" | xargs wc -l
# Dependencies (Swift Package Manager)
cat ./reviews/{attempt_repo}/Package.swift | grep -A 50 "dependencies:"
# Dependencies (CocoaPods)
cat ./reviews/{attempt_repo}/Podfile 2>/dev/null
# Dependencies (Xcode project - SPM)
grep -r "repositoryURL" ./reviews/{attempt_repo}/*.xcodeproj/project.pbxproj 2>/dev/null | head -20
# File count
find ./reviews/{attempt_repo}/Sources -name "*.swift" | wc -l
# Check for SwiftLint configuration
ls ./reviews/{attempt_repo}/.swiftlint.yml 2>/dev/null
5. Extract Git Metrics and Analyze Development Timeline
Analyze the development history to separate agent-driven work from human interactions.
Constraints:
- •You MUST capture: total commits, time from first to last commit
- •You MUST separate commits into agent-driven vs human-driven phases
- •You MUST calculate autonomous duration (agent work only)
- •You SHOULD capture: number of reverts, force pushes (if detectable)
- •You SHOULD extract commit messages mentioning "fix", "revert", "oops"
5a. Gather Raw Git Data
cd ./reviews/{attempt_repo}
# Full commit history with timestamps and messages
git log --format="%ai | %H | %s" --reverse
# Count total commits
git log --oneline | wc -l
# Find fix/revert commits
git log --format="%H %s" | grep -iE "(fix|revert|oops|wrong)"
# Get first and last commit times
git log --format="%ai" --reverse | head -1 # First commit
git log --format="%ai" | head -1 # Last commit
5b. Identify Development Phases
Analyze commit timestamps and messages to identify distinct phases:
Phase 1: Setup (Human)
- •Initial commit, repo setup, file uploads
- •Typically first 1-3 commits before implementation starts
- •Look for: "Initial commit", "Add files", "upload", "setup"
Phase 2: Agent Implementation
- •Bulk implementation work by the agent
- •Characterized by:
- •Rapid succession of commits (minutes apart)
- •Large code changes
- •Messages like "Implement", "Add", "Create"
- •Consistent commit patterns (same author, similar timing)
Phase 3: Agent Test Iteration
- •Test fixing and iteration by the agent
- •Characterized by:
- •Commits mentioning "fix", "test", "pass"
- •Still rapid succession
- •Often shows progression: "Fix X" → "Fix Y" → "100% pass"
Phase 4: Human Intervention (Post-Completion)
- •Human-driven changes after agent work completes
- •Characterized by:
- •Time gaps (hours/days after previous commits)
- •Different commit patterns or author info
- •Messages about data, documentation, cleanup
- •Changes not required by the spec
5c. Heuristics for Identifying Agent vs Human Commits
Agent commits typically show:
- •Timestamps within minutes of each other
- •Consistent formatting in commit messages
- •Co-authored-by lines mentioning Claude/AI
- •Large, comprehensive changes
- •Focus on implementation and tests
Human commits typically show:
- •Time gaps of hours or days from previous work
- •Different commit message style
- •Focus on data, docs, or polish
- •Smaller, targeted changes
- •Work done after "100% tests pass" milestone
# Look for time gaps > 1 hour between commits (potential phase boundaries) git log --format="%ai" --reverse | while read ts; do echo "$ts"; done # Check for co-author lines indicating AI git log --format="%b" | grep -i "co-authored" # Check prompts.txt for session boundaries cat prompts.txt 2>/dev/null | grep -E "^(Done|Session|Agent)"
5d. Calculate Duration Metrics
| Metric | How to Calculate |
|---|---|
| Total Duration | Last commit - First commit |
| Agent Duration | Sum of time during agent phases only |
| Human Duration | Sum of time during human phases |
| Autonomous Duration | Phase 2 + Phase 3 (implementation + test fixing) |
Example Timeline Analysis:
09:00:00 - Initial commit (Human Setup)
09:05:00 - Add spec file (Human Setup)
--- Agent work begins ---
09:15:00 - Implement Phase 1 (Agent)
09:45:00 - Implement Phase 2 (Agent)
10:10:00 - Implement Phase 3 (Agent)
10:25:00 - Fix test issues (Agent)
10:40:00 - 100% tests pass (Agent)
--- Agent work ends ---
--- 2 day gap ---
Oct 3 - Add real data (Human)
Oct 3 - Update docs (Human)
Agent Duration: ~1h 25m (09:15 → 10:40)
Human Duration: ~5m setup + later changes
Autonomous Duration: ~1h 25m
5e. Document in Report
Include a Development Timeline section in the report:
## Development Duration Breakdown | Phase | Duration | Description | |-------|----------|-------------| | **Setup (Human)** | ~5 min | Initial commit, file upload | | **Phase 1: Implementation** | ~55 min | Agent implements all phases | | **Phase 2: Test Fixing** | ~30 min | Agent iterates to 100% pass | | **Total Autonomous** | **~1h 25m** | Agent work only | | **Phase 3: Human Intervention** | 2 days later | Data and docs added | ### Commit Analysis - Total commits: 15 - Agent commits: 10 (09:15 - 10:40 on Day 1) - Human commits: 5 (setup + Day 3 changes) - Fix commits: 3 (normal iteration, not rework)
6. Analyze Against Spec
Review implementation completeness against spec.md requirements.
Constraints:
- •You MUST evaluate against ALL 16 canonical requirements listed below
- •You MUST assess each as: implemented, partial, missing
- •You SHOULD note implementation approach for each
- •You MUST NOT make subjective quality judgments beyond spec compliance
- •You MUST use the exact requirement numbering for cross-attempt consistency
6.0 Canonical Requirements Checklist (16 Requirements)
All evaluations MUST use this exact checklist to ensure consistency across attempts.
Functional Requirements (6):
- •[FR-1] Search and return match data from all CSV files
- •[FR-2] Search and return player data
- •[FR-3] Calculate basic statistics (wins, losses, goals)
- •[FR-4] Compare teams head-to-head
- •[FR-5] Handle team name variations correctly
- •[FR-6] Return properly formatted responses
Query Performance (3): 7. [QP-1] Simple lookups respond in < 2 seconds 8. [QP-2] Aggregate queries respond in < 5 seconds 9. [QP-3] No timeout errors
Data Coverage (3): 10. [DC-1] All 6 CSV files are loadable and queryable 11. [DC-2] At least 20 sample questions can be answered 12. [DC-3] Cross-file queries work (player + match data)
Technical Requirements (4): 13. [TR-1] MCP server implementation with callable tools 14. [TR-2] BDD testing with Given-When-Then structure 15. [TR-3] UTF-8 encoding support (Portuguese characters: ã, ç, é, etc.) 16. [TR-4] Multiple date format handling (ISO, Brazilian DD/MM/YYYY, with time)
Report Format for Requirements:
## Requirements Checklist ### Functional Requirements (X/6) - [x] [FR-1] Search and return match data from all CSV files - [x] [FR-2] Search and return player data - [ ] [FR-3] Calculate basic statistics (partial: missing draws) ... ### Query Performance (X/3) - [x] [QP-1] Simple lookups respond in < 2 seconds ... ### Data Coverage (X/3) - [x] [DC-1] All 6 CSV files are loadable and queryable ... ### Technical Requirements (X/4) - [x] [TR-1] MCP server implementation with callable tools - [x] [TR-2] BDD testing with Given-When-Then structure ... **Total: X/16 requirements implemented**
6a. Real Data vs Simulated Data Assessment
Determine whether the implementation uses real external data or simulated/mock data.
Real Data Indicators:
- •Data loaders for external sources (Kaggle, APIs, etc.)
- •CSV/JSON files in data directory
- •API client code with authentication
- •Data normalization/mapping logic for external schemas
Simulated Data Indicators:
- •Hardcoded test fixtures
- •Factory/faker-generated data
- •Mock data in test files only
- •No external data loading code
Constraints for Real Data Implementations:
- •You MUST note which external data source is used
- •You MUST assess schema mapping quality (how well does the implementation adapt external schema to spec schema)
- •You MUST distinguish between:
- •Schema Implemented: The code defines models matching spec entities
- •Data Populated: The data loader can populate those fields from external source
- •Not Available in Source: Spec field cannot be populated because external data doesn't include it
- •You SHOULD credit implementations that adapt to real-world data constraints
- •You SHOULD note any enhancements beyond spec (e.g., additional fields from richer data sources)
Adjusted Compliance Scoring:
- •If real data is used and a spec field is "Not Available in Source", count it as:
- •Implemented if the model/schema supports the field
- •Note the data limitation separately
- •Example: If spec requires "attendance" but Kaggle data has no attendance:
- •Check if Match model has attendance field (schema compliance)
- •Note that field would be null with Kaggle data (data limitation)
- •This is NOT a failure - it's a data source constraint
6b. Documentation Quality Assessment
Evaluate the README.md for essential user documentation.
Required Elements:
- •Setup Instructions: Prerequisites, installation steps, environment configuration
- •MCP Server Setup: How to start the server, how to connect Claude
- •Example Q&A: Sample questions and expected responses/output
Extraction Commands:
# Check README content
head -100 ./reviews/{attempt_repo}/README.md
# Look for key documentation sections
grep -E "Quick Start|Installation|Setup|MCP|Example|Usage" ./reviews/{attempt_repo}/README.md
Documentation Quality Levels:
| Level | Criteria | In Report |
|---|---|---|
| Excellent | All 3 elements + extras (architecture, API ref, troubleshooting) | "Comprehensive README" |
| Good | All 3 required elements present | "Good documentation" |
| Acceptable | 2 of 3 elements | "Partial documentation" |
| Poor | 0-1 elements | "Missing documentation" |
Best Practice Reference:
- •
2025-10-30-python-hive: Excellent (Quick Start, MCP config, 15+ demo questions, architecture, troubleshooting) - •
2025-12-15-python-claude-ruvector: Excellent (detailed setup, claude mcp add example, Q&A with output)
Include in Report:
## Documentation Quality
| Element | Present | Notes |
|---------|---------|-------|
| Setup Instructions | Yes/No | {details} |
| MCP Server Setup | Yes/No | {details} |
| Example Q&A | Yes/No | {details} |
**Assessment:** {Excellent/Good/Acceptable/Poor}
7. Generate Codebase Documentation
Generate comprehensive documentation for the implementation using the codebase-summary SOP.
Constraints:
- •You MUST run the codebase-summary skill on the cloned repository
- •You MUST output documentation to
{output_dir}/{attempt_repo}-summary/ - •You SHOULD use the generated documentation to inform the final report
- •The documentation provides architecture, components, interfaces, and workflow analysis
summarize codebase reviews/{attempt_repo} to {output_dir}/{attempt_repo}-summary/
8. Generate Report
Produce structured evaluation output.
Constraints:
- •You MUST write results to
{output_dir}/{attempt_repo}.md - •You MUST include: attempt name, orchestration pattern, all metrics
- •You MUST use consistent format for cross-attempt comparison
- •You SHOULD include raw data as appendix
Output Format
# Evaluation: {attempt_repo}
## Summary
- **Pattern:** [swarm|hive|solo|...]
- **Spec Compliance:** X/Y requirements
- **Tests:** X passed, Y skipped, Z failed (X effective)
- **Autonomous Duration:** Xh Ym
- **Documentation:** See `{attempt_repo}-summary/`
## Metrics
| Metric | Value |
|--------|-------|
| Lines of Code | |
| Files | |
| Dependencies | |
| Commits (Total) | |
| Commits (Agent) | |
| Commits (Human) | |
| Fix Commits | |
| Tests (Total) | |
| Tests (Passed) | |
| Tests (Skipped) | |
| Tests (Effective) | |
| Skip Ratio | |
## Development Duration Breakdown
| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | | Initial commit, file upload |
| **Agent Implementation** | | Core implementation work |
| **Agent Test Iteration** | | Test fixing to 100% pass |
| **Total Autonomous** | | Agent work only |
| **Human Intervention** | | Post-completion changes |
### Timeline
{timestamp} - {commit message} ({phase}) ...
### Commit Analysis - Total commits: X - Agent commits: X (timespan) - Human commits: X (description) - Fix commits: X (context: normal iteration vs rework) ## Requirements Checklist - [x] Requirement 1 - [ ] Requirement 2 (partial: notes) - [ ] Requirement 3 (missing) ## Architecture Summary (Key insights from generated codebase documentation) ## Raw Data ...
Troubleshooting
Clone fails
- •Verify repo exists:
gh repo view brazil-bench/{attempt_repo} - •Check permissions: repo must be public or you need access
Tests won't run due to missing dependencies
- •Try starting Neo4j via Docker (see Step 3a above)
- •If Docker unavailable, search for evidence of prior test runs
- •Check git commits for "100% pass" or similar messages
- •Check prompts.txt for pytest output
- •Document as "CANNOT VERIFY" with evidence found
Neo4j connection errors
- •Verify Neo4j is running:
docker ps | grep neo4j - •Check credentials match: NEO4J_AUTH=neo4j/password
- •Wait for startup: Neo4j needs ~10-15 seconds to initialize
- •Check logs:
docker logs neo4j-eval
Spec diff shows changes
- •Fail the evaluation
- •Note the changes in the report
- •This invalidates the benchmark comparison
Codebase documentation fails
- •Verify the codebase-summary skill is available
- •Check that the codebase-path exists and contains code
- •Ensure the output directory is writable
- •Try running the skill standalone first to debug