AgentSkillsCN

vllm-deployment

部署vLLM以实现高性能LLM推理。涵盖Docker CPU/GPU部署以及云虚拟机 provision,配备与OpenAI兼容的API端点。

SKILL.md
--- frontmatter
name: vllm-deployment
description: |
  Deploy vLLM for high-performance LLM inference. Covers Docker CPU/GPU deployments and cloud VM provisioning with OpenAI-compatible API endpoints.
license: MIT
tags:
  - vllm
  - llm
  - inference
  - gpu
  - ai
  - machine-learning
  - docker
  - openai-api
metadata:
  author: Stakpak <team@stakpak.dev>
  version: "1.0.3"

vLLM Model Serving and Inference

Quick Start

Docker (CPU)

bash
docker run --rm -p 8000:8000 \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32
# Access: http://localhost:8000

Docker (GPU)

bash
docker run --rm -p 8000:8000 \
  --gpus all \
  --shm-size=4g \
  <vllm-gpu-image> \
  --model <model-name>
# Access: http://localhost:8000

Docker Deployment

1. Assess Hardware Requirements

HardwareMinimum RAMRecommended
CPU2x model size4x model size
GPUModel size + 2GBModel size + 4GB VRAM
  • Check model documentation for specific requirements
  • Consider quantized variants to reduce memory footprint
  • Allocate 50-100GB storage for model downloads

2. Pull the Container Image

bash
# CPU image (check vLLM docs for latest tag)
docker pull <vllm-cpu-image>

# GPU image (check vLLM docs for latest tag)
docker pull <vllm-gpu-image>

Notes:

  • Use CPU-specific images for CPU inference
  • Use CUDA-enabled images matching your GPU architecture
  • Verify CPU instruction set compatibility (AVX512, AVX2)

3. Configure and Run

CPU Deployment:

bash
docker run --rm \
  --shm-size=4g \
  --cap-add SYS_NICE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  -e VLLM_CPU_OMP_THREADS_BIND=0-7 \
  <vllm-cpu-image> \
  --model <model-name> \
  --dtype float32 \
  --max-model-len 2048

GPU Deployment:

bash
docker run --rm \
  --gpus all \
  --shm-size=4g \
  -p 8000:8000 \
  <vllm-gpu-image> \
  --model <model-name> \
  --dtype auto \
  --max-model-len 4096

4. Verify Deployment

bash
# Check health
curl http://localhost:8000/health

# List models
curl http://localhost:8000/v1/models

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model-name>", "prompt": "Hello", "max_tokens": 10}'

5. Update

bash
docker pull <vllm-image>
docker stop <container-id>
# Re-run with same parameters

Cloud VM Deployment

1. Provision Infrastructure

bash
# Create security group with rules:
# - TCP 22 (SSH)
# - TCP 8000 (API)

# Launch instance with:
# - Sufficient RAM/VRAM for model
# - Docker pre-installed (or install manually)
# - 50-100GB root volume
# - Public IP for external access

2. Connect and Deploy

bash
ssh -i <key-file> <user>@<instance-ip>

# Install Docker if not present
# Pull and run vLLM container (see Docker Deployment section)

3. Verify External Access

bash
# From local machine
curl http://<instance-ip>:8000/health
curl http://<instance-ip>:8000/v1/models

4. Cleanup

bash
# Stop container
docker stop <container-id>

# Terminate instance to stop costs
# Delete associated resources (volumes, security groups) if temporary

Configuration Reference

Environment Variables

VariablePurposeExample
VLLM_CPU_KVCACHE_SPACEKV cache size in GB (CPU)4
VLLM_CPU_OMP_THREADS_BINDCPU core binding (CPU)0-7
CUDA_VISIBLE_DEVICESGPU device selection0,1
HF_TOKENHuggingFace authenticationhf_xxx

Docker Flags

FlagPurpose
--shm-size=4gShared memory for IPC
--cap-add SYS_NICENUMA optimization (CPU)
--security-opt seccomp=unconfinedMemory policy syscalls (CPU)
--gpus allGPU access
-p 8000:8000Port mapping

vLLM Arguments

ArgumentPurposeExample
--modelModel name/path<model-name>
--dtypeData typefloat32, auto, bfloat16
--max-model-lenMax context length2048
--tensor-parallel-sizeMulti-GPU parallelism2

API Endpoints

EndpointMethodPurpose
/healthGETHealth check
/v1/modelsGETList available models
/v1/completionsPOSTText completion
/v1/chat/completionsPOSTChat completion
/metricsGETPrometheus metrics

Production Checklist

  • Verify model fits in available memory
  • Configure appropriate data type for hardware
  • Set up firewall/security group rules
  • Test API endpoints before production use
  • Configure monitoring (Prometheus metrics)
  • Set up health check alerts
  • Document model and configuration used
  • Plan for model updates and rollbacks

Troubleshooting

IssueSolution
Container exits immediatelyIncrease RAM or use smaller model
Slow inference (CPU)Verify OMP thread binding configuration
Connection refused externallyCheck firewall/security group rules
Model download failsSet HF_TOKEN for gated models
Out of memory during inferenceReduce max_model_len or batch size
Port already in useChange host port mapping
Warmup takes too longNormal for large models (1-5 min)

References