Server Lifecycle Management
Manages the full cycle: kill stale -> launch -> health check -> benchmark -> collect results -> kill.
Step 1: Clean Environment
Kill any stale processes from previous runs:
bash
pkill -9 -f sglang 2>/dev/null pkill -9 -f aiperf 2>/dev/null sleep 3 nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv,noheader
Verify GPU memory is free. If not, find and kill the holding process:
bash
nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader
Step 2: Launch Server
Determine which project/venv to use:
- •Check if user specified a project context
- •Default to
~/sglang-poc-pinif it exists - •Read the project's CLAUDE.md for standard launch args
Standard PIN dev launch:
bash
source ~/sglang-poc-pin/.venv/bin/activate CUDA_VISIBLE_DEVICES=0,1 python -m sglang.launch_server \ --model-path Qwen/Qwen3-14B-FP8 \ --port 30000 \ --mem-fraction-static 0.50 \ --tp-size 2 \ --enable-hierarchical-cache \ --hicache-ratio 2.0 \ --hicache-write-policy write_through \ --trust-remote-code \ --log-level info \ --watchdog-timeout 120000 \ --enable-metrics \ --enable-cache-report \ --context-length 32768
Run as background task. Poll for health:
bash
for i in $(seq 1 60); do curl -s localhost:30000/health 2>/dev/null && echo " Server ready" && break sleep 5 done
Step 3: Run Benchmark
Use the appropriate benchmark script or aiperf command. Always:
- •Run baseline first, then treatment (e.g., pinned)
- •If results matter, run both orderings (A/B then B/A) to control for ordering bias
- •Use a fresh server for each phase (kill + relaunch between phases)
- •Save results to
~/design/benchmarks/results/
Example with aiperf:
bash
cd ~/aiperf uv run aiperf profile Qwen/Qwen3-14B-FP8 \ --url http://localhost:30000 \ --endpoint-type chat \ --input-file ~/datasets/long_multiturn_opus.jsonl \ --custom-dataset-type multi-turn \ --concurrency 16 \ --streaming \ --request-timeout-seconds 300 \ --artifact-dir ~/design/benchmarks/results/
Example with benchmark script:
bash
source ~/sglang-poc-pin/.venv/bin/activate python ~/design/benchmarks/pin_benchmark_v8.py \ --depths 10 \ --phase baseline \ --output-dir /tmp/benchmark_results
Step 4: Collect and Verify
bash
# Check metrics during run curl -s localhost:30000/metrics | grep -E 'hicache|cache_hit|evicted|num_requests' | grep -v '^#' # Copy results cp /tmp/benchmark_results/results.json ~/design/benchmarks/results/ # Check PIN-specific logs if relevant grep '\[PIN\]' /tmp/sglang_server_*.log | tail -20
Step 5: Cleanup
Always kill server after benchmarks:
bash
pkill -9 -f sglang 2>/dev/null pkill -9 -f aiperf 2>/dev/null sleep 2 ps aux | grep -E 'sglang|aiperf' | grep -v grep | wc -l # should be 0
Datasets
All in ~/datasets/:
- •
long_multiturn_opus.jsonl-- 10 synthetic multi-turn sessions (flood/eviction workload) - •
claude_history_sonnet.jsonl-- single real Claude Code session (VIP workload) - •
claude_history_10_sessions.jsonl-- 10 real sessions, 585 turns - •
claude_history_10_sessions_with_thinking.jsonl-- same with thinking blocks
aiperf Concurrency Guidelines
- •Quick smoke test:
--concurrency 1 --request-count 5 - •Standard load test:
--concurrency 16-32 - •Stress test:
--concurrency 64
Notes
- •If server hangs with no errors logged, that is a bug -- check for silent scheduler spin
- •
--watchdog-timeout 120000prevents false watchdog kills during long benchmarks - •
--mem-fraction-static 0.50leaves room for HiCache host memory - •
--hicache-ratio 2.0means 2x GPU memory allocated for host-side cache - •aiperf repo has its own CLAUDE.md with architecture details -- read it if making aiperf changes