AgentSkillsCN

Containerize E2e Experiments

容器化端到端实验

SKILL.md

Skill: Containerize E2E Experiments

Overview

AttributeValue
Date2026-01-15
ObjectiveContainerize E2E experiment execution for isolation, consistency, and multi-experiment workflows
Outcome✅ SUCCESS - Single-container architecture with credential mounting, interactive shell, Python 3.14.2
Duration~4 hours (multiple iterations)
Final StatusProduction-ready with 84% test pass rate

When to Use This Skill

Use this approach when you need to:

  • Isolate experiment execution from host environment
  • Run multiple experiments in the same container session
  • Upgrade Python versions without affecting host system
  • Ensure reproducible results across different machines
  • Simplify credential management for Claude Code CLI
  • Avoid nested containers (previous complex architecture)

Trigger Conditions:

  • User requests container execution for experiments
  • User wants to run experiments with different Python versions
  • User needs isolated environment for testing
  • User wants to run multiple experiments without container restart overhead

Verified Workflow

Phase 1: Architecture Decision

What worked:

code
❌ OLD: Host → Container per agent → Container per judge (nested)
✅ NEW: Host → Single container → All agents + judges (simple)

Key insight: Run the entire run_e2e_experiment.py inside one container, not orchestrate multiple containers from host.

Benefits:

  • No container startup overhead per execution
  • Simpler credential management (mount once)
  • Better debugging (everything in one place)
  • Easier resource control

Phase 2: Python Version Upgrade

Dockerfile change:

dockerfile
# OLD
FROM python:3.10-slim

# NEW
FROM python:3.14.2-slim

Compatibility fixes required:

python
# OLD (Python 3.11+)
from datetime import UTC, datetime
datetime.now(UTC)

# NEW (Python 3.10+ compatible, works on 3.14.2)
from datetime import datetime, timezone
datetime.now(timezone.utc)

Files updated: 22 files across:

  • scylla/e2e/*.py (8 files)
  • scylla/adapters/*.py (4 files)
  • scylla/executor/*.py (2 files)
  • scylla/reporting/*.py (4 files)
  • scylla/cli/main.py + orchestrator.py + test files

Phase 3: Credential Mounting (Critical!)

Problem: Claude Code credentials at ~/.claude/.credentials.json have restrictive permissions (600) and container user can't read them.

Solution - Two-step mounting:

  1. Wrapper script (scripts/run_experiment_in_container.sh):
bash
# Create temp directory with proper permissions
TEMP_CREDS_DIR="${PROJECT_DIR}/.tmp-container-creds"
mkdir -p "${TEMP_CREDS_DIR}"

# Copy with world-readable permissions
cp "${CREDS_FILE}" "${TEMP_CREDS_DIR}/.credentials.json"
chmod 644 "${TEMP_CREDS_DIR}/.credentials.json"

# Mount to container
-v "${TEMP_CREDS_DIR}:/tmp/host-creds:ro"

# Cleanup on exit
trap "rm -rf ${TEMP_CREDS_DIR}" EXIT
  1. Entrypoint script (docker/entrypoint.sh):
bash
ensure_clean_claude_environment() {
    mkdir -p "${HOME}/.claude"
    chmod 700 "${HOME}/.claude"

    if [[ -f "/tmp/host-creds/.credentials.json" ]]; then
        cp "/tmp/host-creds/.credentials.json" "${HOME}/.claude/.credentials.json"
        chmod 600 "${HOME}/.claude/.credentials.json"
        log_info "Copied credentials to ${HOME}/.claude/.credentials.json"
    fi
}

Critical: Must call ensure_clean_claude_environment() for both bash shell and python script launches!

Phase 4: Interactive Container Script

New script: scripts/launch_container_shell.sh

Purpose: Start persistent container for running multiple experiments without restart overhead.

Key features:

bash
# Launch interactive bash shell
docker run --rm -it \
    --name "${CONTAINER_NAME}" \
    --workdir /workspace \
    -v "${PROJECT_DIR}:/workspace" \
    -v "${TEMP_CREDS_DIR}:/tmp/host-creds:ro" \
    "${IMAGE_NAME}" \
    bash

Welcome message on startup:

code
==========================================
ProjectScylla Container Shell
==========================================
Credentials: /home/scylla/.claude/.credentials.json
  ✓ Credentials found

Run experiments:
  python scripts/run_e2e_experiment.py \
    --tiers-dir tests/fixtures/tests/test-001 \
    --tiers T0 --runs 1 -v
==========================================

Phase 5: Permission Fixes

Results directory:

bash
# In wrapper script
mkdir -p "${PROJECT_DIR}/results"
chmod 777 "${PROJECT_DIR}/results"

Output files from container:

bash
# In entrypoint (for legacy agent/judge container modes)
chmod 666 /output/result.json /output/stdout.log /output/stderr.log

Failed Attempts (Lessons Learned)

❌ Attempt 1: Nested Container Architecture

What we tried:

  • Host runs run_e2e_experiment.py
  • Each agent execution spawns a Docker container
  • Each judge execution spawns a Docker container

Why it failed:

  • Complex credential mounting (per-container setup)
  • Permission conflicts between host and containers
  • Container startup overhead for every execution
  • Difficult to debug (nested logs)
  • 150+ lines of container orchestration code

Time wasted: ~2 hours

Lesson: Simpler is better. Run everything in one container.

❌ Attempt 2: Direct Directory Mount for Credentials

What we tried:

bash
# Mount entire .claude directory
-v "${HOME}/.claude:/home/scylla/.claude:ro"

Why it failed:

code
Permission denied: cannot open directory '/home/scylla/.claude/'

Root cause: Host .claude directory has 700 permissions (owner-only), container user (UID 999) can't read it.

Time wasted: 30 minutes

Lesson: Can't directly mount restrictive directories. Need intermediate temp directory with relaxed permissions.

❌ Attempt 3: Skipping Credential Setup for Bash Shell

What we tried:

bash
# In entrypoint.sh
bash|sh)
    # Just launch shell, skip credential setup
    exec "$@"
    ;;

Why it failed:

  • User launches container
  • Credentials not set up
  • claude command fails with auth error
  • User has to manually run claude auth every time

Time wasted: 15 minutes (quick fix)

Lesson: ALWAYS call ensure_clean_claude_environment() before launching bash shell or python scripts.

❌ Attempt 4: Python 3.10 with datetime.UTC

What we tried:

  • Use Python 3.10 container
  • Keep from datetime import UTC in code

Why it failed:

python
ImportError: cannot import name 'UTC' from 'datetime'

Root cause: UTC only exists in Python 3.11+

Fix required: Update 22 files to use timezone.utc instead

Time wasted: 1 hour (finding all occurrences)

Lesson: Check Python version compatibility before building container. Use timezone.utc for Python 3.10+ compatibility.

Results & Verified Parameters

Final Test Results

Experiment: test-001 (Hello World task)

MetricValue
Score0.840 (84%)
GradeA
Status✅ PASS
Cost$0.1087
Duration256s (4.3 min)
Agent Time133s
Judge Time118s
Cache Efficiency99.97% (25,002 cached / 25,010 total input tokens)

Functional requirements: 100% achieved Code quality: 100% achieved Proportionality: 85.7% (deduction for __pycache__ cleanup) Overall quality: 85%

Container Specifications

Image: scylla-runner:latest

dockerfile
FROM python:3.14.2-slim
# Includes:
# - Node.js 20.x LTS
# - Claude Code CLI
# - Git, make, build tools
# - Scylla package (installed from source)

User: scylla (UID 999) Working Directory: /workspace Entrypoint: /entrypoint.sh

Wrapper Scripts

Single experiment (auto-exit):

bash
./scripts/run_experiment_in_container.sh \
    --tiers-dir tests/fixtures/tests/test-001 \
    --tiers T0 --runs 1 -v

Interactive shell (multiple experiments):

bash
./scripts/launch_container_shell.sh [container-name]

# Inside container:
python scripts/run_e2e_experiment.py --tiers-dir ... --tiers T0 --runs 1 -v
python scripts/run_e2e_experiment.py --tiers-dir ... --tiers T1 --runs 5 -v
exit

Volume Mounts

SourceTargetModePurpose
${PROJECT_DIR}/workspacerwCode and results
${TEMP_CREDS_DIR}/tmp/host-credsroCredentials (intermediate)

Key Configuration Files

Created:

  • docker/Dockerfile - Python 3.14.2 image
  • docker/entrypoint.sh - Credential setup + welcome message
  • scripts/run_experiment_in_container.sh - Single experiment runner
  • scripts/launch_container_shell.sh - Interactive shell launcher
  • docs/container-usage.md - Usage guide
  • docs/container-authentication.md - Auth troubleshooting

Modified:

  • 22 Python files: datetime.UTCdatetime.timezone.utc
  • scylla/e2e/models.py: use_containers = TrueFalse (deprecated)
  • scylla/e2e/subtest_executor.py: Disabled nested container execution

Key Takeaways

Architecture Patterns

  1. Single container > Nested containers - Simpler, faster, easier to debug
  2. Mount credentials via temp directory - Avoids permission conflicts
  3. Always setup credentials before shell - Call ensure_clean_claude_environment() for all entry points
  4. Interactive shell for iteration - Avoid container restart overhead

Technical Gotchas

  1. datetime.UTC requires Python 3.11+ - Use timezone.utc for 3.10+ compatibility
  2. Docker creates directories for missing file mounts - Use directory mount with file inside
  3. Host .claude permissions block container access - Use intermediate temp directory
  4. Entrypoint case statement order matters - bash|sh must call credential setup
  5. chmod 777 on results directory - Container user needs write access

Performance Wins

  1. Cache efficiency: 99.97% input token cache hit rate
  2. No container overhead: Single container vs. N containers per experiment
  3. Fast iteration: Interactive shell allows instant re-runs
  4. Cost effective: $0.11 per experiment run with excellent caching

Usage Examples

Quick Test

bash
./scripts/run_experiment_in_container.sh \
    --tiers-dir tests/fixtures/tests/test-001 \
    --tiers T0 --runs 1 --max-subtests 1 -v

Iterative Development

bash
./scripts/launch_container_shell.sh dev-session

# Inside container:
python scripts/run_e2e_experiment.py --tiers-dir tests/fixtures/tests/test-001 --tiers T0 --runs 1 -v
# Review results...

python scripts/run_e2e_experiment.py --tiers-dir tests/fixtures/tests/test-001 --tiers T0 --runs 1 --fresh -v
# Review results...

exit

Full Evaluation

bash
./scripts/run_experiment_in_container.sh \
    --tiers-dir tests/fixtures/tests/test-001 \
    --tiers T0 T1 T2 T3 T4 T5 T6 \
    --runs 10 \
    -v

Related Documentation

Tags

#docker #containers #credentials #authentication #python-upgrade #interactive-shell #e2e-testing #isolation