AgentSkillsCN

openenv-benchmark

运行 OpenEnv 扩展性和并发性基准测试实验。当您部署基准测试基础设施(本地 uvicorn、本地 Docker、HF Spaces、SLURM 单节点、SLURM 多节点),运行 test_scaling.py 测试,或分析实验结果时,可使用此技能。触发条件包括:发起基准测试、测试扩展性、衡量并发性能、比较 HTTP 与 WebSocket 性能,或审阅实验报告。

SKILL.md
--- frontmatter
name: openenv-benchmark
description: Run OpenEnv scaling and concurrency benchmark experiments. Use when deploying benchmark infrastructure (local uvicorn, local docker, HF Spaces, SLURM single-node, SLURM multi-node), running test_scaling.py tests, or analyzing experiment results. Triggers on requests to benchmark, test scaling, measure concurrency, compare HTTP vs WebSocket performance, or review experiment reports.

OpenEnv Benchmark Experiments

Run scaling experiments to measure maximum concurrent batch sizes across infrastructure options.

Workflow Overview

  1. Deploy infrastructure (choose one: local-uvicorn, local-docker, hf-spaces, slurm-single, slurm-multi)
  2. Run scaling tests with tests/test_scaling.py
  3. Analyze results with experiments/scripts/analyze_results.py

Step 1: Deploy Infrastructure

Prerequisites

bash
pip install -e .  # or pip install -e ".[analysis]" for matplotlib
python -c "from benchmark.server.app import app; print('OK')"

Infrastructure Options

InfrastructureDeploy CommandURLMax Batch
local-uvicorn./deploy/local/run_uvicorn.shhttp://localhost:800064-128
local-docker./deploy/local/run_docker.shhttp://localhost:800064-128
hf-spaces./deploy/hf_spaces/deploy.sh --repo-id USER/openenv-benchmarkhttps://USER-openenv-benchmark.hf.space10-32
slurm-singlesbatch deploy/slurm/serve_single.shhttp://${SLURM_NODE_IP}:8000128-256
slurm-multi./deploy/slurm/alloc.sh then ./deploy/slurm/serve_multi.shhttp://${ENVOY_IP}:8000256-512

Deploy Commands

Local Uvicorn (configurable workers):

bash
WORKERS=8 PORT=8000 MAX_CONCURRENT_ENVS=200 ./deploy/local/run_uvicorn.sh

Local Docker:

bash
./deploy/local/run_docker.sh
# Or manually: docker run -d --name openenv-benchmark -p 8000:8000 -e WORKERS=4 openenv-benchmark:latest

HF Spaces:

bash
export HF_USER="your-username"
./deploy/hf_spaces/deploy.sh --repo-id ${HF_USER}/openenv-benchmark
# Wake up before testing:
curl https://${HF_USER}-openenv-benchmark.hf.space/health

SLURM Single Node:

bash
sbatch deploy/slurm/serve_single.sh
export JOB_ID=$(squeue -u $USER -h -o "%i" | head -1)
export SLURM_NODE_IP=$(squeue -j $JOB_ID -h -o "%N")
# Wait for server:
while ! curl -s http://${SLURM_NODE_IP}:8000/health > /dev/null 2>&1; do sleep 5; done

SLURM Multi-Node (with Envoy load balancer):

bash
WORKERS=4 CPUS_PER_WORKER=4 ./deploy/slurm/alloc.sh  # Opens interactive shell
./deploy/slurm/serve_multi.sh
source openenv-connection.env
echo "URL: $OPENENV_URL"

Verify Deployment

bash
curl http://localhost:8000/health
python tests/test_scaling.py --url http://localhost:8000 -n 5 -w 0.5

Step 2: Run Scaling Tests

test_scaling.py CLI Reference

OptionDefaultDescription
--url, -uhttp://localhost:8000Server URL
--requests, -n10Concurrent requests (batch size)
--wait, -w1.0Wait time per request (seconds)
--mode, -mwsTest mode: http or ws
--requests-grid-Comma-separated batch sizes for grid sweep
--wait-grid-Comma-separated wait times for grid sweep
--reps1Repetitions per configuration
--comparefalseRun both HTTP and WebSocket
--output-dir, -o-Output directory for JSONL/CSV
--timeout, -t120.0Timeout per request

Standard Experiment

Full grid sweep comparing HTTP vs WebSocket:

bash
python tests/test_scaling.py \
    --url http://localhost:8000 \
    --requests-grid 1,2,4,8,16,32,64,128 \
    --wait-grid 0.1,1.0,5.0 \
    --reps 3 \
    --compare \
    --output-dir experiments/results/local-uvicorn/$(date +%Y-%m-%d)

Quick Validation Test

bash
python tests/test_scaling.py \
    --url http://localhost:8000 \
    --requests-grid 1,4,16,64 \
    --wait-grid 1.0 \
    --reps 1 \
    --mode ws \
    --output-dir experiments/results/local-uvicorn/quick-test

Infrastructure-Specific Recommendations

  • HF Spaces Free Tier: Use --requests-grid 1,2,4,8,16 --timeout 180
  • SLURM Single: Use --requests-grid 1,2,4,8,16,32,64,128,256
  • SLURM Multi: Use --requests-grid 1,2,4,8,16,32,64,128,256,512

Step 3: Analyze Results

Output Files

Tests generate:

  • raw.jsonl - Per-session detailed results (request_id, latencies, pid, session_hash, host_url, errors)
  • summary.csv - Aggregated statistics (success rates, p50/p90/p95/p99 latencies, throughput, effective_concurrency)

analyze_results.py CLI Reference

bash
# Analyze single experiment
python experiments/scripts/analyze_results.py \
    --input experiments/results/local-uvicorn/2026-01-09

# Analyze all infrastructures
python experiments/scripts/analyze_results.py --all

# Custom success threshold (default 95%)
python experiments/scripts/analyze_results.py \
    --input experiments/results/local-uvicorn/2026-01-09 \
    --success-threshold 0.90
OptionDescription
--input, -iInput directory with raw.jsonl and summary.csv
--allAnalyze all infrastructures in experiments/results/
--output, -oOutput directory for figures (default: experiments/reports/figures/)
--success-thresholdSuccess rate threshold for max batch (default: 0.95)
--tables-onlyGenerate tables only, skip figures
--figures-onlyGenerate figures only, skip tables

Generated Reports

  • experiments/reports/tables.md - Markdown tables (max batch, protocol comparison, latency breakdown)
  • experiments/reports/figures/ - PNG plots (max_batch_comparison.png, scaling_curves.png, latency_heatmap.png)
  • experiments/reports/EXPERIMENT_LOG.md - Run history

Key Metrics to Review

  1. Max Batch Size: Largest concurrent batch achieving 95% success rate
  2. Protocol Comparison: WS typically 10-20x higher throughput than HTTP
  3. Latency Breakdown: connect_p50, reset_p50, step_p50, total_p99
  4. Distribution Metrics: unique_pids, unique_sessions, unique_hosts (verify load balancing)

Verify Load Balancing (Multi-Node)

bash
python -c "
import json
hosts = set()
with open('experiments/results/slurm-multi/$(date +%Y-%m-%d)/raw.jsonl') as f:
    for line in f:
        data = json.loads(line)
        if data.get('host_url'):
            hosts.add(data['host_url'])
print(f'Unique hosts: {len(hosts)}')
print(hosts)
"

Cleanup

bash
# Local uvicorn
pkill -f "uvicorn benchmark.server.app"

# Local docker
docker stop openenv-benchmark && docker rm openenv-benchmark

# SLURM
scancel $JOB_ID  # or exit the allocation shell

Troubleshooting

IssueSolution
Port in uselsof -i :8000 then kill -9 <PID>
Connection refusedVerify server running: curl http://localhost:8000/health
High error rateReduce MAX_CONCURRENT_ENVS or increase WORKERS
HF Space sleepingSend health check requests to wake up
SLURM job won't startCheck sinfo -p hopper-cpu for partition availability
Uneven load distributionVerify all worker nodes started, check Envoy config