Evaluate Benchmark Attempt

Name: evaluate-attempt
Rating: 76
Author: brazil-bench

Overview

This SOP evaluates a completed brazil-bench attempt against the spec.md requirements, capturing metrics for comparison across orchestration patterns. Supports both Python and Swift/iOS implementations.

Parameters

•attempt_repo (required): Repository name (e.g., attempt-3)
•output_dir (optional, default: ./results): Where to write evaluation results

Steps

0. Detect Project Language

Identify the primary language/platform of the implementation.

Detection Commands:

bash

cd ./reviews/{attempt_repo}

# Python indicators
ls pyproject.toml setup.py requirements.txt 2>/dev/null

# Swift/iOS indicators
ls Package.swift *.xcodeproj *.xcworkspace 2>/dev/null

# Check file extensions
find . -name "*.py" -not -path "./.venv/*" | head -5
find . -name "*.swift" | head -5

Language Detection Matrix:

Files Found	Language	Test Framework
`pyproject.toml`, `*.py`	Python	pytest
`Package.swift`, `*.swift`	Swift Package	swift test
`.xcodeproj`, `.swift`	iOS/Xcode	xcodebuild test
Both Python and Swift	Multi-language	Run both

Constraints:

•You MUST detect the language before running tests
•You MUST use appropriate commands for the detected language
•You SHOULD note the detected language in the report

1. Clone Attempt

Fetch the attempt repository for local analysis.

Constraints:

•You MUST clone into ./reviews/{attempt_repo}
•You MUST verify the clone succeeded before proceeding
•You MUST NOT modify any files in the cloned repo

bash

gh repo clone brazil-bench/{attempt_repo} ./reviews/{attempt_repo}

2. Verify Spec Integrity

Confirm the spec.md was not modified from the template.

Constraints:

•You MUST compare spec.md against the template version
•You MUST fail the evaluation if spec.md was modified
•You SHOULD use a checksum comparison

bash

gh repo clone brazil-bench/benchmark-template ./reviews/_template --depth 1
diff ./reviews/{attempt_repo}/spec.md ./reviews/_template/spec.md

3. Run Conformance Tests

Execute the test suite defined in the spec against the implementation.

Constraints:

•You MUST attempt to run all tests specified in spec.md
•You MUST capture pass/fail counts and output
•You SHOULD timeout tests after 60 seconds each
•You MAY retry flaky tests once
•If tests fail due to missing dependencies, follow the dependency resolution steps below

Python Test Commands

bash

cd ./reviews/{attempt_repo}

# Run pytest with verbose output
pytest --tb=short -v 2>&1 | tee test_output.log

# Get summary counts
pytest --tb=no -q 2>&1 | tail -5

Swift/iOS Test Commands

bash

cd ./reviews/{attempt_repo}

# Swift Package Manager
swift test 2>&1 | tee test_output.log

# Xcode project (iOS Simulator)
xcodebuild test \
    -project *.xcodeproj \
    -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    2>&1 | tee test_output.log

# Parse xcodebuild results
grep -E "(Test Case|passed|failed)" test_output.log

# Using xcpretty for cleaner output (if available)
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    | xcpretty --report junit

3a. Handle Missing Dependencies (Neo4j, etc.)

If tests fail due to missing external dependencies like Neo4j:

Step 1: Try to start the dependency via Docker

bash

# Check if Docker is available
docker --version

# Check for docker-compose files in the repo
ls ./reviews/{attempt_repo}/docker-compose*.yml

# If Neo4j docker-compose exists, start it
docker-compose -f ./reviews/{attempt_repo}/docker-compose.neo4j.yml up -d

# Or start Neo4j directly
docker run -d --name neo4j-eval -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:5

# Wait for Neo4j to be ready
sleep 10
docker logs neo4j-eval 2>&1 | tail -5

Step 2: If Docker unavailable or fails, look for evidence of prior test runs

Check these sources for test results:

bash

# Check git history for test-related commits
git log --oneline --all | grep -iE "(test|pass|100%|fix.*test)"

# Check for CI/CD logs or badges
cat ./reviews/{attempt_repo}/README.md | grep -iE "(pass|badge|ci|test)"

# Check prompts.txt for test execution evidence
cat ./reviews/{attempt_repo}/prompts.txt 2>/dev/null | grep -iE "(pytest|test|pass|fail|scenario)"

# Check for pytest cache with results
ls -la ./reviews/{attempt_repo}/.pytest_cache/ 2>/dev/null

# Check for coverage reports
ls -la ./reviews/{attempt_repo}/htmlcov/ ./reviews/{attempt_repo}/coverage.xml 2>/dev/null

Step 3: Document findings in the report

If tests cannot be run directly, document:

•Why tests couldn't run (missing Neo4j, etc.)
•Evidence found of prior test runs (commit messages, prompts.txt entries)
•Claimed test results from the attempt's documentation
•Mark as "CANNOT VERIFY" with explanation

Constraints for dependency handling:

•You MUST try Docker first if available
•You MUST search for evidence if Docker fails
•You MUST NOT claim tests pass without verification
•You SHOULD note the source of any claimed test results
•You SHOULD clean up Docker containers after evaluation: docker stop neo4j-eval && docker rm neo4j-eval

3b. Detect Skipped Tests

Skipped tests inflate test counts without providing actual verification. You MUST detect and report them separately.

Python: Detect Skipped Tests

Step 1: Run pytest with verbose output to capture skipped tests

bash

cd ./reviews/{attempt_repo}

# Run pytest and capture skip count
pytest --tb=no -v 2>&1 | grep -E "(PASSED|FAILED|SKIPPED|ERROR)" | head -100

# Get summary counts
pytest --tb=no -q 2>&1 | tail -5

# Look for skip patterns in test files
grep -r "pytest.skip\|@pytest.mark.skip\|skipif\|xfail" tests/ --include="*.py"

Step 2: Analyze test files for skip patterns

bash

# Count tests that call pytest.skip() inside the test body (worst pattern)
grep -r "pytest.skip(" tests/ --include="*.py" -l | wc -l

# Count tests with @pytest.mark.skip decorator
grep -r "@pytest.mark.skip" tests/ --include="*.py" | wc -l

# Count conditional skips (skipif)
grep -r "@pytest.mark.skipif" tests/ --include="*.py" | wc -l

Swift/iOS: Detect Skipped Tests

Step 1: Run swift test or xcodebuild and capture skipped tests

bash

cd ./reviews/{attempt_repo}

# Swift Package Manager - look for skipped in output
swift test 2>&1 | grep -E "(passed|failed|skipped)"

# Xcode - parse test results
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    2>&1 | grep -E "Test Case.*passed|Test Case.*failed|skipped"

Step 2: Analyze test files for skip patterns

bash

# Count XCTSkip usage (explicit skips)
grep -r "XCTSkip\|throw XCTSkip" Tests/ --include="*.swift" | wc -l

# Count disabled tests (func name doesn't start with test)
grep -r "func disabled_test\|// func test" Tests/ --include="*.swift" | wc -l

# Count tests with availability checks that skip
grep -r "@available\|#available" Tests/ --include="*.swift" -A 2 | grep -i skip | wc -l

# Look for conditional test execution
grep -r "guard.*else.*return\|if.*XCTSkip" Tests/ --include="*.swift" | wc -l

Swift Skip Patterns:

Pattern	Type	Assessment
`throw XCTSkip("reason")`	Explicit skip	Acceptable if documented
`#if !targetEnvironment(simulator)`	Conditional	Acceptable for device-only
`@available(iOS 16, *)`	Version skip	Acceptable
Renamed to `disabled_testFoo`	Hidden skip	Should be penalized
Empty test body	Stub	Should be penalized

Step 3: Calculate effective test count

Metric	How to Calculate
Total Tests	Number of test functions defined
Passed Tests	Tests that ran and passed
Skipped Tests	Tests marked skip or calling pytest.skip()
Effective Tests	Total - Skipped (tests that actually run)
Skip Ratio	Skipped / Total (percentage of tests that skip)

Constraints for skipped test handling:

•You MUST report skipped tests separately from passed tests
•You MUST calculate the "effective test count" (passed + failed, excluding skipped)
•You MUST flag ANY skipped tests for issue filing - zero tolerance for skips
•
You MUST distinguish between skip types for the issue description:
- •Conditional skips (@pytest.mark.skipif): Document reason in issue
- •Unconditional skips (pytest.skip() in body): Critical - tests never run
- •Decorator skips (@pytest.mark.skip): Document reason in issue
•You MUST NOT count skipped tests toward the test score in rankings
•You MUST file an issue for ANY skipped test (no acceptable skip threshold)

Example Analysis:

code

Total tests:     59
Passed:          44
Skipped:         15  (25% skip ratio - HIGH)
Failed:          0
Effective:       44  (use this for scoring, not 59)

Skip breakdown:
- pytest.skip() in body: 15 (integration tests that never run)
- @pytest.mark.skipif: 0
- @pytest.mark.skip: 0

Flag: INFLATED TEST COUNT - 15 tests skip unconditionally

Document in Report:

markdown

## Test Results

| Metric | Count |
|--------|-------|
| Total Tests | 59 |
| Passed | 44 |
| **Skipped** | **15** |
| Failed | 0 |
| **Effective Tests** | **44** |
| Skip Ratio | 25% |

⚠️ **Warning:** 15 tests (25%) are skipped and never execute.
These are integration tests that call `pytest.skip()` inside the test body.
The effective test count for scoring is 44, not 59.

3c. Self-Contained Integration Tests (REQUIRED)

Integration tests MUST be self-contained and actually run. Tests that skip because "Neo4j not available" or similar are not acceptable.

Requirement: Integration tests must start their own data stores as needed.

Detection Commands:

bash

cd ./reviews/{attempt_repo}

# Check for testcontainers usage (Python)
grep -r "testcontainers\|TestContainer\|DockerContainer" tests/ --include="*.py"

# Check for docker-compose in test setup
grep -r "docker-compose\|subprocess.*docker" tests/ --include="*.py"

# Check for pytest-docker fixture
grep -r "pytest-docker\|docker_compose" tests/ --include="*.py" pyproject.toml

# Check for in-memory alternatives (e.g., SQLite instead of Postgres)
grep -r "sqlite.*memory\|:memory:\|MockNeo4j\|FakeNeo4j" tests/ --include="*.py"

# Check for conftest fixtures that start services
grep -A 20 "@pytest.fixture" tests/conftest.py 2>/dev/null | grep -E "docker\|container\|start\|neo4j"

# Swift: Check for test containers
grep -r "Docker\|Container\|TestServer" Tests/ --include="*.swift"

Acceptable Patterns for Self-Contained Tests:

Pattern	Example	Assessment
testcontainers	`Neo4jContainer()` in fixture	✓ Best - automatic lifecycle
pytest-docker	`docker_compose_file` fixture	✓ Good - compose-based
conftest startup	Fixture runs `docker run neo4j`	✓ Acceptable - manual but works
In-memory mock	`MockNeo4jClient` class	✗ NOT acceptable - not persistent
External dependency	`pytest.skip("Neo4j not running")`	✗ NOT acceptable
CI-only tests	`@pytest.mark.skipif(not CI)`	✗ NOT acceptable
No integration tests	No tests for data layer	✗ NOT acceptable

Example: testcontainers Pattern (Python)

python

# conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer

@pytest.fixture(scope="session")
def neo4j_container():
    """Start Neo4j container for integration tests."""
    with Neo4jContainer("neo4j:5") as neo4j:
        yield neo4j

@pytest.fixture
def neo4j_client(neo4j_container):
    """Get client connected to test container."""
    return Neo4jClient(
        uri=neo4j_container.get_connection_url(),
        auth=("neo4j", "password")
    )

Example: pytest-docker Pattern

python

# conftest.py
import pytest

@pytest.fixture(scope="session")
def docker_compose_file():
    return "docker-compose.test.yml"

@pytest.fixture(scope="session")
def neo4j_service(docker_services):
    """Wait for Neo4j to be ready."""
    docker_services.wait_until_responsive(
        timeout=30.0,
        pause=0.5,
        check=lambda: is_neo4j_ready()
    )

Scoring Impact:

Integration Test Quality	Score Modifier
Self-contained (testcontainers/docker)	No penalty
In-memory mock (not persistent)	-10 points quality
Skips due to missing dependency	-10 points quality
No integration tests at all	-15 points quality

Constraints:

•You MUST check if integration tests are self-contained
•You MUST flag tests that skip due to external dependencies
•You MUST NOT accept "works on CI" as justification for skipping locally
•You SHOULD recommend testcontainers or pytest-docker patterns
•You SHOULD verify integration tests actually execute (not just exist)

Document in Report:

markdown

## Integration Test Quality

| Aspect | Status |
|--------|--------|
| Self-contained | Yes/No |
| Data store management | testcontainers / docker-compose / mock / external |
| Integration tests run | X passed, Y skipped |

⚠️ **Issue:** Integration tests skip when Neo4j is not running.
Tests should use testcontainers or pytest-docker to manage dependencies.

3d. Context Header Blocks (REQUIRED)

Every source code file MUST have a context header comment block that documents:

•Purpose - What the file/module does
•Interfaces - Key classes, functions, or APIs exposed
•Change History - Record of modifications (updated on every change)

Detection Commands:

bash

cd ./reviews/{attempt_repo}

# Python: Check for docstrings or header comments in source files
for f in $(find src -name "*.py" -not -name "__init__.py"); do
  echo "=== $f ==="
  head -50 "$f" | grep -E '""".*|^#.*Purpose|^#.*Context|CONTEXT BLOCK|Change History|Interfaces'
done

# Swift: Check for header comments
for f in $(find Sources -name "*.swift" 2>/dev/null); do
  echo "=== $f ==="
  head -50 "$f" | grep -E '///|/\*\*|Purpose|Context|History'
done

# Count files with context headers vs total
total=$(find src -name "*.py" -not -name "__init__.py" | wc -l)
with_header=$(find src -name "*.py" -not -name "__init__.py" -exec head -30 {} \; -exec echo "---" \; | grep -l "CONTEXT\|Purpose\|Module:" | wc -l)
echo "Files with headers: $with_header / $total"

Required Header Format (Python):

python

"""
================================================================================
CONTEXT BLOCK
================================================================================
File: {filename}
Module: {module.path}
Purpose: {one-line description}

Description:
    {detailed description of what this module does}

Interfaces:
    - {ClassName}: {brief description}
    - {function_name}(): {brief description}

Dependencies:
    - {module}: {why needed}

Change History:
    - {date}: {description of change}
    - {date}: Initial creation
================================================================================
"""

Required Header Format (Swift):

swift

//
//  {FileName}.swift
//  {ProjectName}
//
//  Purpose: {one-line description}
//
//  Interfaces:
//    - {ClassName}: {brief description}
//    - {functionName}(): {brief description}
//
//  Change History:
//    - {date}: {description of change}
//    - {date}: Initial creation
//

Assessment Criteria:

Coverage	Assessment	Score Impact
100% files have headers	Excellent	No penalty
75-99% files have headers	Good	-2 quality
50-74% files have headers	Partial	-5 quality
<50% files have headers	Poor	-10 quality

Constraints:

•You MUST check all source files for context headers
•You MUST verify headers include purpose, interfaces, and change history
•You MUST flag files missing headers for issue filing
•You SHOULD note which files have incomplete headers (missing sections)

Document in Report:

markdown

## Context Header Compliance

| Metric | Count |
|--------|-------|
| Source files | X |
| With headers | Y |
| Coverage | Z% |

### Files Missing Headers
- `src/module.py` - No header
- `src/utils.py` - Missing change history

### Assessment
{Excellent/Good/Partial/Poor} - {X}% coverage

4. Measure Code Metrics

Collect quantitative data about the implementation.

Constraints:

•You MUST capture: total lines of code, number of files, dependencies
•You SHOULD capture: cyclomatic complexity, test coverage
•You MAY capture: documentation coverage, type hint coverage

Python Metrics

bash

# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/src -name "*.py" | xargs wc -l

# Dependencies
cat ./reviews/{attempt_repo}/pyproject.toml | grep dependencies -A 50

# File count
find ./reviews/{attempt_repo}/src -name "*.py" | wc -l

Swift/iOS Metrics

bash

# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/Sources -name "*.swift" | xargs wc -l

# For Xcode projects
find ./reviews/{attempt_repo} -name "*.swift" -not -path "*/Tests/*" -not -path "*Test*" | xargs wc -l

# Dependencies (Swift Package Manager)
cat ./reviews/{attempt_repo}/Package.swift | grep -A 50 "dependencies:"

# Dependencies (CocoaPods)
cat ./reviews/{attempt_repo}/Podfile 2>/dev/null

# Dependencies (Xcode project - SPM)
grep -r "repositoryURL" ./reviews/{attempt_repo}/*.xcodeproj/project.pbxproj 2>/dev/null | head -20

# File count
find ./reviews/{attempt_repo}/Sources -name "*.swift" | wc -l

# Check for SwiftLint configuration
ls ./reviews/{attempt_repo}/.swiftlint.yml 2>/dev/null

5. Extract Git Metrics and Analyze Development Timeline

Analyze the development history to separate agent-driven work from human interactions.

Constraints:

•You MUST capture: total commits, time from first to last commit
•You MUST separate commits into agent-driven vs human-driven phases
•You MUST calculate autonomous duration (agent work only)
•You SHOULD capture: number of reverts, force pushes (if detectable)
•You SHOULD extract commit messages mentioning "fix", "revert", "oops"

5a. Gather Raw Git Data

bash

cd ./reviews/{attempt_repo}

# Full commit history with timestamps and messages
git log --format="%ai | %H | %s" --reverse

# Count total commits
git log --oneline | wc -l

# Find fix/revert commits
git log --format="%H %s" | grep -iE "(fix|revert|oops|wrong)"

# Get first and last commit times
git log --format="%ai" --reverse | head -1  # First commit
git log --format="%ai" | head -1             # Last commit

5b. Identify Development Phases

Analyze commit timestamps and messages to identify distinct phases:

Phase 1: Setup (Human)

•Initial commit, repo setup, file uploads
•Typically first 1-3 commits before implementation starts
•Look for: "Initial commit", "Add files", "upload", "setup"

Phase 2: Agent Implementation

•Bulk implementation work by the agent
•
Characterized by:
- •Rapid succession of commits (minutes apart)
- •Large code changes
- •Messages like "Implement", "Add", "Create"
- •Consistent commit patterns (same author, similar timing)

Phase 3: Agent Test Iteration

•Test fixing and iteration by the agent
•
Characterized by:
- •Commits mentioning "fix", "test", "pass"
- •Still rapid succession
- •Often shows progression: "Fix X" → "Fix Y" → "100% pass"

Phase 4: Human Intervention (Post-Completion)

•Human-driven changes after agent work completes
•
Characterized by:
- •Time gaps (hours/days after previous commits)
- •Different commit patterns or author info
- •Messages about data, documentation, cleanup
- •Changes not required by the spec

5c. Heuristics for Identifying Agent vs Human Commits

Agent commits typically show:

•Timestamps within minutes of each other
•Consistent formatting in commit messages
•Co-authored-by lines mentioning Claude/AI
•Large, comprehensive changes
•Focus on implementation and tests

Human commits typically show:

•Time gaps of hours or days from previous work
•Different commit message style
•Focus on data, docs, or polish
•Smaller, targeted changes
•Work done after "100% tests pass" milestone

bash

# Look for time gaps > 1 hour between commits (potential phase boundaries)
git log --format="%ai" --reverse | while read ts; do echo "$ts"; done

# Check for co-author lines indicating AI
git log --format="%b" | grep -i "co-authored"

# Check prompts.txt for session boundaries
cat prompts.txt 2>/dev/null | grep -E "^(Done|Session|Agent)"

5d. Calculate Duration Metrics

Metric	How to Calculate
Total Duration	Last commit - First commit
Agent Duration	Sum of time during agent phases only
Human Duration	Sum of time during human phases
Autonomous Duration	Phase 2 + Phase 3 (implementation + test fixing)

Example Timeline Analysis:

code

09:00:00 - Initial commit (Human Setup)
09:05:00 - Add spec file (Human Setup)
         --- Agent work begins ---
09:15:00 - Implement Phase 1 (Agent)
09:45:00 - Implement Phase 2 (Agent)
10:10:00 - Implement Phase 3 (Agent)
10:25:00 - Fix test issues (Agent)
10:40:00 - 100% tests pass (Agent)
         --- Agent work ends ---
         --- 2 day gap ---
Oct 3     - Add real data (Human)
Oct 3     - Update docs (Human)

Agent Duration: ~1h 25m (09:15 → 10:40)
Human Duration: ~5m setup + later changes
Autonomous Duration: ~1h 25m

5e. Document in Report

Include a Development Timeline section in the report:

markdown

## Development Duration Breakdown

| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | ~5 min | Initial commit, file upload |
| **Phase 1: Implementation** | ~55 min | Agent implements all phases |
| **Phase 2: Test Fixing** | ~30 min | Agent iterates to 100% pass |
| **Total Autonomous** | **~1h 25m** | Agent work only |
| **Phase 3: Human Intervention** | 2 days later | Data and docs added |

### Commit Analysis
- Total commits: 15
- Agent commits: 10 (09:15 - 10:40 on Day 1)
- Human commits: 5 (setup + Day 3 changes)
- Fix commits: 3 (normal iteration, not rework)

6. Analyze Against Spec

Review implementation completeness against spec.md requirements.

Constraints:

•You MUST evaluate against ALL 16 canonical requirements listed below
•You MUST assess each as: implemented, partial, missing
•You SHOULD note implementation approach for each
•You MUST NOT make subjective quality judgments beyond spec compliance
•You MUST use the exact requirement numbering for cross-attempt consistency

6.0 Canonical Requirements Checklist (16 Requirements)

All evaluations MUST use this exact checklist to ensure consistency across attempts.

Functional Requirements (6):

•[FR-1] Search and return match data from all CSV files
•[FR-2] Search and return player data
•[FR-3] Calculate basic statistics (wins, losses, goals)
•[FR-4] Compare teams head-to-head
•[FR-5] Handle team name variations correctly
•[FR-6] Return properly formatted responses

Query Performance (3): 7. [QP-1] Simple lookups respond in < 2 seconds 8. [QP-2] Aggregate queries respond in < 5 seconds 9. [QP-3] No timeout errors

Data Coverage (3): 10. [DC-1] All 6 CSV files are loadable and queryable 11. [DC-2] At least 20 sample questions can be answered 12. [DC-3] Cross-file queries work (player + match data)

Technical Requirements (4): 13. [TR-1] MCP server implementation with callable tools 14. [TR-2] BDD testing with Given-When-Then structure 15. [TR-3] UTF-8 encoding support (Portuguese characters: ã, ç, é, etc.) 16. [TR-4] Multiple date format handling (ISO, Brazilian DD/MM/YYYY, with time)

Report Format for Requirements:

markdown

## Requirements Checklist

### Functional Requirements (X/6)
- [x] [FR-1] Search and return match data from all CSV files
- [x] [FR-2] Search and return player data
- [ ] [FR-3] Calculate basic statistics (partial: missing draws)
...

### Query Performance (X/3)
- [x] [QP-1] Simple lookups respond in < 2 seconds
...

### Data Coverage (X/3)
- [x] [DC-1] All 6 CSV files are loadable and queryable
...

### Technical Requirements (X/4)
- [x] [TR-1] MCP server implementation with callable tools
- [x] [TR-2] BDD testing with Given-When-Then structure
...

**Total: X/16 requirements implemented**

6a. Real Data vs Simulated Data Assessment

Determine whether the implementation uses real external data or simulated/mock data.

Real Data Indicators:

•Data loaders for external sources (Kaggle, APIs, etc.)
•CSV/JSON files in data directory
•API client code with authentication
•Data normalization/mapping logic for external schemas

Simulated Data Indicators:

•Hardcoded test fixtures
•Factory/faker-generated data
•Mock data in test files only
•No external data loading code

Constraints for Real Data Implementations:

•You MUST note which external data source is used
•You MUST assess schema mapping quality (how well does the implementation adapt external schema to spec schema)
•
You MUST distinguish between:
- •Schema Implemented: The code defines models matching spec entities
- •Data Populated: The data loader can populate those fields from external source
- •Not Available in Source: Spec field cannot be populated because external data doesn't include it
•You SHOULD credit implementations that adapt to real-world data constraints
•You SHOULD note any enhancements beyond spec (e.g., additional fields from richer data sources)

Adjusted Compliance Scoring:

•
If real data is used and a spec field is "Not Available in Source", count it as:
- •Implemented if the model/schema supports the field
- •Note the data limitation separately
•
Example: If spec requires "attendance" but Kaggle data has no attendance:
- •Check if Match model has attendance field (schema compliance)
- •Note that field would be null with Kaggle data (data limitation)
- •This is NOT a failure - it's a data source constraint

6b. Documentation Quality Assessment

Evaluate the README.md for essential user documentation.

Required Elements:

•Setup Instructions: Prerequisites, installation steps, environment configuration
•MCP Server Setup: How to start the server, how to connect Claude
•Example Q&A: Sample questions and expected responses/output

Extraction Commands:

bash

# Check README content
head -100 ./reviews/{attempt_repo}/README.md

# Look for key documentation sections
grep -E "Quick Start|Installation|Setup|MCP|Example|Usage" ./reviews/{attempt_repo}/README.md

Documentation Quality Levels:

Level	Criteria	In Report
Excellent	All 3 elements + extras (architecture, API ref, troubleshooting)	"Comprehensive README"
Good	All 3 required elements present	"Good documentation"
Acceptable	2 of 3 elements	"Partial documentation"
Poor	0-1 elements	"Missing documentation"

Best Practice Reference:

•2025-10-30-python-hive: Excellent (Quick Start, MCP config, 15+ demo questions, architecture, troubleshooting)
•2025-12-15-python-claude-ruvector: Excellent (detailed setup, claude mcp add example, Q&A with output)

Include in Report:

markdown

## Documentation Quality

| Element | Present | Notes |
|---------|---------|-------|
| Setup Instructions | Yes/No | {details} |
| MCP Server Setup | Yes/No | {details} |
| Example Q&A | Yes/No | {details} |

**Assessment:** {Excellent/Good/Acceptable/Poor}

7. Generate Codebase Documentation

Generate comprehensive documentation for the implementation using the codebase-summary SOP.

Constraints:

•You MUST run the codebase-summary skill on the cloned repository
•You MUST output documentation to {output_dir}/{attempt_repo}-summary/
•You SHOULD use the generated documentation to inform the final report
•The documentation provides architecture, components, interfaces, and workflow analysis

code

summarize codebase reviews/{attempt_repo} to {output_dir}/{attempt_repo}-summary/

8. Generate Report

Produce structured evaluation output.

Constraints:

•You MUST write results to {output_dir}/{attempt_repo}.md
•You MUST include: attempt name, orchestration pattern, all metrics
•You MUST use consistent format for cross-attempt comparison
•You SHOULD include raw data as appendix

Output Format

markdown

# Evaluation: {attempt_repo}

## Summary
- **Pattern:** [swarm|hive|solo|...]
- **Spec Compliance:** X/Y requirements
- **Tests:** X passed, Y skipped, Z failed (X effective)
- **Autonomous Duration:** Xh Ym
- **Documentation:** See `{attempt_repo}-summary/`

## Metrics
| Metric | Value |
|--------|-------|
| Lines of Code | |
| Files | |
| Dependencies | |
| Commits (Total) | |
| Commits (Agent) | |
| Commits (Human) | |
| Fix Commits | |
| Tests (Total) | |
| Tests (Passed) | |
| Tests (Skipped) | |
| Tests (Effective) | |
| Skip Ratio | |

## Development Duration Breakdown

| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | | Initial commit, file upload |
| **Agent Implementation** | | Core implementation work |
| **Agent Test Iteration** | | Test fixing to 100% pass |
| **Total Autonomous** | | Agent work only |
| **Human Intervention** | | Post-completion changes |

### Timeline

{timestamp} - {commit message} ({phase}) ...

code


### Commit Analysis
- Total commits: X
- Agent commits: X (timespan)
- Human commits: X (description)
- Fix commits: X (context: normal iteration vs rework)

## Requirements Checklist
- [x] Requirement 1
- [ ] Requirement 2 (partial: notes)
- [ ] Requirement 3 (missing)

## Architecture Summary
(Key insights from generated codebase documentation)

## Raw Data
...

Troubleshooting

Clone fails

•Verify repo exists: gh repo view brazil-bench/{attempt_repo}
•Check permissions: repo must be public or you need access

Tests won't run due to missing dependencies

•Try starting Neo4j via Docker (see Step 3a above)
•If Docker unavailable, search for evidence of prior test runs
•Check git commits for "100% pass" or similar messages
•Check prompts.txt for pytest output
•Document as "CANNOT VERIFY" with evidence found

Neo4j connection errors

•Verify Neo4j is running: docker ps | grep neo4j
•Check credentials match: NEO4J_AUTH=neo4j/password
•Wait for startup: Neo4j needs ~10-15 seconds to initialize
•Check logs: docker logs neo4j-eval

Spec diff shows changes

•Fail the evaluation
•Note the changes in the report
•This invalidates the benchmark comparison

Codebase documentation fails

•Verify the codebase-summary skill is available
•Check that the codebase-path exists and contains code
•Ensure the output directory is writable
•Try running the skill standalone first to debug