vLLM Benchmark

Runs both prefill and decode throughput sweeps in a single invocation. Time-based measurement: sends streaming requests at max concurrency, measures throughput over a fixed window, then cancels and moves on.

Outputs per model:

•benchmark_{model}.md — markdown report with server config, prefill table, decode tables
•benchmark_{model}.json — raw data
•prefill_{model}.png — prefill throughput plot
•decode_{model}.png — decode throughput plot

Step 1: Launch the vLLM server

Start the server as a background task before running benchmarks.

Single-node (TP <= 8):

bash

uv run python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --dtype bfloat16 \
  --tensor-parallel-size <tp> \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --max-num-seqs <max_concurrency> \
  --max-model-len 65536 \
  --port 8000 \
  --host 0.0.0.0

Multi-node (requires Ray cluster):

bash

VLLM_HOST_IP=<ray_head_ip> uv run python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --dtype bfloat16 \
  --tensor-parallel-size <tp> \
  --pipeline-parallel-size <pp> \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --max-num-seqs <max_concurrency> \
  --max-model-len 65536 \
  --port 8000 \
  --host 0.0.0.0

Key notes:

•Multi-node only: VLLM_HOST_IP must be set to the Ray head node's internal/cluster IP. Find it with ray status or hostname -I.
•--no-enable-prefix-caching — must be disabled for benchmarking since all requests use the same prompt, which would inflate prefill numbers ~40x if cached.
•--enable-chunked-prefill — allows interleaving prefill and decode for better throughput.
•--max-num-seqs — must be >= the highest concurrency level being benchmarked, otherwise the server caps the batch size and queues the rest, bottlenecking the results.
•Wait for the server to be healthy before benchmarking: curl http://localhost:8000/health
•Read the KV cache size: X tokens line from server logs — you need this for --kv-cache-tokens.

Step 2: Run the benchmark

Collect these parameters from the user (defaults shown):

•--base-url (default: http://localhost:8000)
•--model (required — the model name served by vLLM)
•--kv-cache-tokens (required — total KV cache capacity in tokens, from server logs)
•--tp (default: 1 — tensor parallel size, for report metadata)
•--kv-cache-dtype (default: auto — for report metadata)
•--ratios (default: 0.25,0.5 — input token fractions for decode sweep)
•--concurrency-levels (default: 32,64,128,256,512,1024,2048)
•--max-model-len (default: 65536)
•--warmup (default: 10.0 — seconds to warm up before measuring)
•--duration (default: 15.0 — seconds to measure throughput)
•--auto-warmup (flag — compute decode warmup from prefill results so measurement starts after all slots finish prefilling)
•--warmup-margin (default: 1.5 — multiplier on auto-warmup estimate)
•--output-dir (default: ./bench-results)

bash

uv run python .claude/skills/vllm-benchmark-skill/scripts/bench_sweep.py \
  --base-url <url> \
  --model <model> \
  --kv-cache-tokens <tokens> \
  --tp <tp_size> \
  --auto-warmup \
  --output-dir <dir>

Important: only run ONE benchmark at a time against a server. Running multiple benchmarks concurrently produces garbage results since they compete for GPU resources.

After the script completes, show the user the generated markdown report at <output-dir>/benchmark_{model}.md.