AgentSkillsCN

vllm-benchmark

在vLLM服务器上运行预填充与解码吞吐量基准测试,并生成带有图表的Markdown报告。

SKILL.md
--- frontmatter
name: vllm-benchmark
description: Run prefill + decode throughput benchmarks against a vLLM server and generate a markdown report with plots.
disable-model-invocation: true

vLLM Benchmark

Runs both prefill and decode throughput sweeps in a single invocation. Time-based measurement: sends streaming requests at max concurrency, measures throughput over a fixed window, then cancels and moves on.

Outputs per model:

  • benchmark_{model}.md — markdown report with server config, prefill table, decode tables
  • benchmark_{model}.json — raw data
  • prefill_{model}.png — prefill throughput plot
  • decode_{model}.png — decode throughput plot

Step 1: Launch the vLLM server

Start the server as a background task before running benchmarks.

Single-node (TP <= 8):

bash
uv run python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --dtype bfloat16 \
  --tensor-parallel-size <tp> \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --max-num-seqs <max_concurrency> \
  --max-model-len 65536 \
  --port 8000 \
  --host 0.0.0.0

Multi-node (requires Ray cluster):

bash
VLLM_HOST_IP=<ray_head_ip> uv run python -m vllm.entrypoints.openai.api_server \
  --model <model> \
  --dtype bfloat16 \
  --tensor-parallel-size <tp> \
  --pipeline-parallel-size <pp> \
  --gpu-memory-utilization 0.95 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --max-num-seqs <max_concurrency> \
  --max-model-len 65536 \
  --port 8000 \
  --host 0.0.0.0

Key notes:

  • Multi-node only: VLLM_HOST_IP must be set to the Ray head node's internal/cluster IP. Find it with ray status or hostname -I.
  • --no-enable-prefix-caching — must be disabled for benchmarking since all requests use the same prompt, which would inflate prefill numbers ~40x if cached.
  • --enable-chunked-prefill — allows interleaving prefill and decode for better throughput.
  • --max-num-seqs — must be >= the highest concurrency level being benchmarked, otherwise the server caps the batch size and queues the rest, bottlenecking the results.
  • Wait for the server to be healthy before benchmarking: curl http://localhost:8000/health
  • Read the KV cache size: X tokens line from server logs — you need this for --kv-cache-tokens.

Step 2: Run the benchmark

Collect these parameters from the user (defaults shown):

  • --base-url (default: http://localhost:8000)
  • --model (required — the model name served by vLLM)
  • --kv-cache-tokens (required — total KV cache capacity in tokens, from server logs)
  • --tp (default: 1 — tensor parallel size, for report metadata)
  • --kv-cache-dtype (default: auto — for report metadata)
  • --ratios (default: 0.25,0.5 — input token fractions for decode sweep)
  • --concurrency-levels (default: 32,64,128,256,512,1024,2048)
  • --max-model-len (default: 65536)
  • --warmup (default: 10.0 — seconds to warm up before measuring)
  • --duration (default: 15.0 — seconds to measure throughput)
  • --auto-warmup (flag — compute decode warmup from prefill results so measurement starts after all slots finish prefilling)
  • --warmup-margin (default: 1.5 — multiplier on auto-warmup estimate)
  • --output-dir (default: ./bench-results)
bash
uv run python .claude/skills/vllm-benchmark-skill/scripts/bench_sweep.py \
  --base-url <url> \
  --model <model> \
  --kv-cache-tokens <tokens> \
  --tp <tp_size> \
  --auto-warmup \
  --output-dir <dir>

Important: only run ONE benchmark at a time against a server. Running multiple benchmarks concurrently produces garbage results since they compete for GPU resources.

After the script completes, show the user the generated markdown report at <output-dir>/benchmark_{model}.md.