vLLM Benchmark
Runs both prefill and decode throughput sweeps in a single invocation. Time-based measurement: sends streaming requests at max concurrency, measures throughput over a fixed window, then cancels and moves on.
Outputs per model:
- •
benchmark_{model}.md— markdown report with server config, prefill table, decode tables - •
benchmark_{model}.json— raw data - •
prefill_{model}.png— prefill throughput plot - •
decode_{model}.png— decode throughput plot
Step 1: Launch the vLLM server
Start the server as a background task before running benchmarks.
Single-node (TP <= 8):
bash
uv run python -m vllm.entrypoints.openai.api_server \ --model <model> \ --dtype bfloat16 \ --tensor-parallel-size <tp> \ --gpu-memory-utilization 0.95 \ --enable-chunked-prefill \ --no-enable-prefix-caching \ --max-num-seqs <max_concurrency> \ --max-model-len 65536 \ --port 8000 \ --host 0.0.0.0
Multi-node (requires Ray cluster):
bash
VLLM_HOST_IP=<ray_head_ip> uv run python -m vllm.entrypoints.openai.api_server \ --model <model> \ --dtype bfloat16 \ --tensor-parallel-size <tp> \ --pipeline-parallel-size <pp> \ --gpu-memory-utilization 0.95 \ --enable-chunked-prefill \ --no-enable-prefix-caching \ --max-num-seqs <max_concurrency> \ --max-model-len 65536 \ --port 8000 \ --host 0.0.0.0
Key notes:
- •Multi-node only:
VLLM_HOST_IPmust be set to the Ray head node's internal/cluster IP. Find it withray statusorhostname -I. - •
--no-enable-prefix-caching— must be disabled for benchmarking since all requests use the same prompt, which would inflate prefill numbers ~40x if cached. - •
--enable-chunked-prefill— allows interleaving prefill and decode for better throughput. - •
--max-num-seqs— must be >= the highest concurrency level being benchmarked, otherwise the server caps the batch size and queues the rest, bottlenecking the results. - •Wait for the server to be healthy before benchmarking:
curl http://localhost:8000/health - •Read the
KV cache size: X tokensline from server logs — you need this for--kv-cache-tokens.
Step 2: Run the benchmark
Collect these parameters from the user (defaults shown):
- •
--base-url(default:http://localhost:8000) - •
--model(required — the model name served by vLLM) - •
--kv-cache-tokens(required — total KV cache capacity in tokens, from server logs) - •
--tp(default:1— tensor parallel size, for report metadata) - •
--kv-cache-dtype(default:auto— for report metadata) - •
--ratios(default:0.25,0.5— input token fractions for decode sweep) - •
--concurrency-levels(default:32,64,128,256,512,1024,2048) - •
--max-model-len(default:65536) - •
--warmup(default:10.0— seconds to warm up before measuring) - •
--duration(default:15.0— seconds to measure throughput) - •
--auto-warmup(flag — compute decode warmup from prefill results so measurement starts after all slots finish prefilling) - •
--warmup-margin(default:1.5— multiplier on auto-warmup estimate) - •
--output-dir(default:./bench-results)
bash
uv run python .claude/skills/vllm-benchmark-skill/scripts/bench_sweep.py \ --base-url <url> \ --model <model> \ --kv-cache-tokens <tokens> \ --tp <tp_size> \ --auto-warmup \ --output-dir <dir>
Important: only run ONE benchmark at a time against a server. Running multiple benchmarks concurrently produces garbage results since they compete for GPU resources.
After the script completes, show the user the generated markdown report at <output-dir>/benchmark_{model}.md.