Skill: Containerize E2E Experiments
Overview
| Attribute | Value |
|---|---|
| Date | 2026-01-15 |
| Objective | Containerize E2E experiment execution for isolation, consistency, and multi-experiment workflows |
| Outcome | ✅ SUCCESS - Single-container architecture with credential mounting, interactive shell, Python 3.14.2 |
| Duration | ~4 hours (multiple iterations) |
| Final Status | Production-ready with 84% test pass rate |
When to Use This Skill
Use this approach when you need to:
- •✅ Isolate experiment execution from host environment
- •✅ Run multiple experiments in the same container session
- •✅ Upgrade Python versions without affecting host system
- •✅ Ensure reproducible results across different machines
- •✅ Simplify credential management for Claude Code CLI
- •✅ Avoid nested containers (previous complex architecture)
Trigger Conditions:
- •User requests container execution for experiments
- •User wants to run experiments with different Python versions
- •User needs isolated environment for testing
- •User wants to run multiple experiments without container restart overhead
Verified Workflow
Phase 1: Architecture Decision
What worked:
❌ OLD: Host → Container per agent → Container per judge (nested) ✅ NEW: Host → Single container → All agents + judges (simple)
Key insight: Run the entire run_e2e_experiment.py inside one container, not orchestrate multiple containers from host.
Benefits:
- •No container startup overhead per execution
- •Simpler credential management (mount once)
- •Better debugging (everything in one place)
- •Easier resource control
Phase 2: Python Version Upgrade
Dockerfile change:
# OLD FROM python:3.10-slim # NEW FROM python:3.14.2-slim
Compatibility fixes required:
# OLD (Python 3.11+) from datetime import UTC, datetime datetime.now(UTC) # NEW (Python 3.10+ compatible, works on 3.14.2) from datetime import datetime, timezone datetime.now(timezone.utc)
Files updated: 22 files across:
- •
scylla/e2e/*.py(8 files) - •
scylla/adapters/*.py(4 files) - •
scylla/executor/*.py(2 files) - •
scylla/reporting/*.py(4 files) - •
scylla/cli/main.py+orchestrator.py+ test files
Phase 3: Credential Mounting (Critical!)
Problem: Claude Code credentials at ~/.claude/.credentials.json have restrictive permissions (600) and container user can't read them.
Solution - Two-step mounting:
- •Wrapper script (
scripts/run_experiment_in_container.sh):
# Create temp directory with proper permissions
TEMP_CREDS_DIR="${PROJECT_DIR}/.tmp-container-creds"
mkdir -p "${TEMP_CREDS_DIR}"
# Copy with world-readable permissions
cp "${CREDS_FILE}" "${TEMP_CREDS_DIR}/.credentials.json"
chmod 644 "${TEMP_CREDS_DIR}/.credentials.json"
# Mount to container
-v "${TEMP_CREDS_DIR}:/tmp/host-creds:ro"
# Cleanup on exit
trap "rm -rf ${TEMP_CREDS_DIR}" EXIT
- •Entrypoint script (
docker/entrypoint.sh):
ensure_clean_claude_environment() {
mkdir -p "${HOME}/.claude"
chmod 700 "${HOME}/.claude"
if [[ -f "/tmp/host-creds/.credentials.json" ]]; then
cp "/tmp/host-creds/.credentials.json" "${HOME}/.claude/.credentials.json"
chmod 600 "${HOME}/.claude/.credentials.json"
log_info "Copied credentials to ${HOME}/.claude/.credentials.json"
fi
}
Critical: Must call ensure_clean_claude_environment() for both bash shell and python script launches!
Phase 4: Interactive Container Script
New script: scripts/launch_container_shell.sh
Purpose: Start persistent container for running multiple experiments without restart overhead.
Key features:
# Launch interactive bash shell
docker run --rm -it \
--name "${CONTAINER_NAME}" \
--workdir /workspace \
-v "${PROJECT_DIR}:/workspace" \
-v "${TEMP_CREDS_DIR}:/tmp/host-creds:ro" \
"${IMAGE_NAME}" \
bash
Welcome message on startup:
==========================================
ProjectScylla Container Shell
==========================================
Credentials: /home/scylla/.claude/.credentials.json
✓ Credentials found
Run experiments:
python scripts/run_e2e_experiment.py \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 1 -v
==========================================
Phase 5: Permission Fixes
Results directory:
# In wrapper script
mkdir -p "${PROJECT_DIR}/results"
chmod 777 "${PROJECT_DIR}/results"
Output files from container:
# In entrypoint (for legacy agent/judge container modes) chmod 666 /output/result.json /output/stdout.log /output/stderr.log
Failed Attempts (Lessons Learned)
❌ Attempt 1: Nested Container Architecture
What we tried:
- •Host runs
run_e2e_experiment.py - •Each agent execution spawns a Docker container
- •Each judge execution spawns a Docker container
Why it failed:
- •Complex credential mounting (per-container setup)
- •Permission conflicts between host and containers
- •Container startup overhead for every execution
- •Difficult to debug (nested logs)
- •150+ lines of container orchestration code
Time wasted: ~2 hours
Lesson: Simpler is better. Run everything in one container.
❌ Attempt 2: Direct Directory Mount for Credentials
What we tried:
# Mount entire .claude directory
-v "${HOME}/.claude:/home/scylla/.claude:ro"
Why it failed:
Permission denied: cannot open directory '/home/scylla/.claude/'
Root cause: Host .claude directory has 700 permissions (owner-only), container user (UID 999) can't read it.
Time wasted: 30 minutes
Lesson: Can't directly mount restrictive directories. Need intermediate temp directory with relaxed permissions.
❌ Attempt 3: Skipping Credential Setup for Bash Shell
What we tried:
# In entrypoint.sh
bash|sh)
# Just launch shell, skip credential setup
exec "$@"
;;
Why it failed:
- •User launches container
- •Credentials not set up
- •
claudecommand fails with auth error - •User has to manually run
claude authevery time
Time wasted: 15 minutes (quick fix)
Lesson: ALWAYS call ensure_clean_claude_environment() before launching bash shell or python scripts.
❌ Attempt 4: Python 3.10 with datetime.UTC
What we tried:
- •Use Python 3.10 container
- •Keep
from datetime import UTCin code
Why it failed:
ImportError: cannot import name 'UTC' from 'datetime'
Root cause: UTC only exists in Python 3.11+
Fix required: Update 22 files to use timezone.utc instead
Time wasted: 1 hour (finding all occurrences)
Lesson: Check Python version compatibility before building container. Use timezone.utc for Python 3.10+ compatibility.
Results & Verified Parameters
Final Test Results
Experiment: test-001 (Hello World task)
| Metric | Value |
|---|---|
| Score | 0.840 (84%) |
| Grade | A |
| Status | ✅ PASS |
| Cost | $0.1087 |
| Duration | 256s (4.3 min) |
| Agent Time | 133s |
| Judge Time | 118s |
| Cache Efficiency | 99.97% (25,002 cached / 25,010 total input tokens) |
Functional requirements: 100% achieved
Code quality: 100% achieved
Proportionality: 85.7% (deduction for __pycache__ cleanup)
Overall quality: 85%
Container Specifications
Image: scylla-runner:latest
FROM python:3.14.2-slim # Includes: # - Node.js 20.x LTS # - Claude Code CLI # - Git, make, build tools # - Scylla package (installed from source)
User: scylla (UID 999)
Working Directory: /workspace
Entrypoint: /entrypoint.sh
Wrapper Scripts
Single experiment (auto-exit):
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 1 -v
Interactive shell (multiple experiments):
./scripts/launch_container_shell.sh [container-name] # Inside container: python scripts/run_e2e_experiment.py --tiers-dir ... --tiers T0 --runs 1 -v python scripts/run_e2e_experiment.py --tiers-dir ... --tiers T1 --runs 5 -v exit
Volume Mounts
| Source | Target | Mode | Purpose |
|---|---|---|---|
${PROJECT_DIR} | /workspace | rw | Code and results |
${TEMP_CREDS_DIR} | /tmp/host-creds | ro | Credentials (intermediate) |
Key Configuration Files
Created:
- •
docker/Dockerfile- Python 3.14.2 image - •
docker/entrypoint.sh- Credential setup + welcome message - •
scripts/run_experiment_in_container.sh- Single experiment runner - •
scripts/launch_container_shell.sh- Interactive shell launcher - •
docs/container-usage.md- Usage guide - •
docs/container-authentication.md- Auth troubleshooting
Modified:
- •22 Python files:
datetime.UTC→datetime.timezone.utc - •
scylla/e2e/models.py:use_containers = True→False(deprecated) - •
scylla/e2e/subtest_executor.py: Disabled nested container execution
Key Takeaways
Architecture Patterns
- •Single container > Nested containers - Simpler, faster, easier to debug
- •Mount credentials via temp directory - Avoids permission conflicts
- •Always setup credentials before shell - Call
ensure_clean_claude_environment()for all entry points - •Interactive shell for iteration - Avoid container restart overhead
Technical Gotchas
- •
datetime.UTCrequires Python 3.11+ - Usetimezone.utcfor 3.10+ compatibility - •Docker creates directories for missing file mounts - Use directory mount with file inside
- •Host
.claudepermissions block container access - Use intermediate temp directory - •Entrypoint case statement order matters -
bash|shmust call credential setup - •
chmod 777on results directory - Container user needs write access
Performance Wins
- •Cache efficiency: 99.97% input token cache hit rate
- •No container overhead: Single container vs. N containers per experiment
- •Fast iteration: Interactive shell allows instant re-runs
- •Cost effective: $0.11 per experiment run with excellent caching
Usage Examples
Quick Test
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 1 --max-subtests 1 -v
Iterative Development
./scripts/launch_container_shell.sh dev-session # Inside container: python scripts/run_e2e_experiment.py --tiers-dir tests/fixtures/tests/test-001 --tiers T0 --runs 1 -v # Review results... python scripts/run_e2e_experiment.py --tiers-dir tests/fixtures/tests/test-001 --tiers T0 --runs 1 --fresh -v # Review results... exit
Full Evaluation
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 T1 T2 T3 T4 T5 T6 \
--runs 10 \
-v
Related Documentation
Tags
#docker #containers #credentials #authentication #python-upgrade #interactive-shell #e2e-testing #isolation