Add New Model to SGLang Cookbook
This is an interactive, multi-step workflow. Collect inputs incrementally from the user as needed — do NOT ask for everything upfront.
Phase 1: Collect Initial Inputs
Ask the user for:
- •Model Card - HuggingFace model name or URL (e.g.,
Qwen/Qwen3-Coder-Nextorhttps://huggingface.co/deepseek-ai/DeepSeek-V3). Fetch the page to extract model description, supported capabilities, and other details. If the model is not yet public or the page is inaccessible, ask the user to paste as much information as they know (model name, parameter count, architecture, capabilities, context length, etc.). - •Model Variants - Check if the model family has multiple size variants (e.g., 480B/30B) and quantization options (e.g., BF16/FP8). Ask the user which variants to include. This affects:
- •ConfigGenerator:
modelSizeoption withmodelConfigsper size,quantizationoption for BF16/FP8 - •YAML:
modelslist with differentbase_nameentries,quantizationsarrays - •Documentation: model name references and deployment examples
- •See
Qwen3CoderConfigGeneratorandQwen3NextConfigGeneratorfor multi-variant examples.
- •ConfigGenerator:
- •Full Deployment Command - Complete SGLang launch command including all strategy and optimization flags (tp, dp, ep, enable_dp_attention, etc.). If the model card already provides an SGLang deployment command, offer it as a default option.
- •SGLang Version - The SGLang version the user is testing on (e.g.,
v0.5.8). This determines the version subdirectory for YAML configs (data/models/src/<version>/).
Phase 2: Create Scaffolding
With model card and deployment command in hand, read ALL reference templates first, then create files.
Reference Templates
Read these files to understand existing patterns before creating anything:
- •Doc template: Find a similar model doc under
docs/autoregressive/(e.g.,Qwen3-Coder.mdfor a Qwen model,DeepSeek-V3_2.mdfor DeepSeek) - •ConfigGenerator template: Find a similar generator under
src/components/autoregressive/(e.g.,Qwen3NextConfigGenerator/index.js,Qwen3CoderConfigGenerator/index.js) - •YAML template:
data/models/src/<version>/<similar-model>.yaml(e.g.,qwen3next.yaml). Listdata/models/src/to see available version directories. - •Sidebar:
sidebars.js - •Vendors:
data/models/vendors.yaml
Important Patterns
- •ConfigGenerator components are placed FLAT under
src/components/autoregressive/<ModelNameConfigGenerator>/index.js(NOT nested in vendor folders) - •YAML source files are in
data/models/src/<version>/(NOT directly indata/models/src/). The version directory corresponds to the SGLang version being tested. Create it if it doesn't exist. - •The base
ConfigGeneratorcomponent is atsrc/components/base/ConfigGenerator - •AMD GPUs (MI300X/MI325X/MI355X) typically need
--attention-backend triton - •For models with
commandRuleon options, use the pattern from existing generators to apply rules viaObject.entries(this.options).forEach(...)
Step 1: Create documentation file
Create docs/autoregressive/<Vendor>/<ModelName>.md with ALL sections pre-populated:
- •Section 1: Model introduction (from model card)
- •Section 2: SGLang installation instructions
- •Section 3: Model deployment section (embed the config generator component)
- •Section 4: Model invocation — include deployment command at the TOP of this section, then pre-fill test scripts (code generation, streaming, tool calling if supported) with
TODOplaceholders for outputs. This allows the user to copy commands directly from the rendered page. - •Section 5: Benchmarks — pre-fill with benchmark commands and
TODOplaceholders for results
Benchmark commands reference:
- •GSM8K:
python benchmark/gsm8k/bench_sglang.py --port <port> - •MMLU:
cd benchmark/mmlu && bash download_data.sh && python3 bench_sglang.py --port <port> - •Latency benchmark:
python3 -m sglang.bench_serving --backend sglang --num-prompts 10 --max-concurrency 1 ... - •Throughput benchmark:
python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --max-concurrency 100 ...
Important:
- •When the output contains nested markdown code blocks (e.g., model outputs python code), use four backticks ```````` for the outer block to avoid rendering issues.
- •Do NOT hardcode sampling parameters (
temperature,top_p,top_k) in code examples. SGLang automatically applies the recommended parameters from the model'sgeneration_config.json.
Step 2: Update sidebar
Edit sidebars.js to add the new entry under the appropriate vendor category.
Step 3: Create config generator component
Create src/components/autoregressive/<ModelName>ConfigGenerator/index.js based on the deployment command provided.
Key considerations:
- •Use the base
ConfigGeneratorcomponent - •Define
modelConfigswith per-hardware tp/ep settings - •Add
commandRulefor optional features (tool calling, reasoning parser, etc.) - •Default all optimizations and parsers to Enabled (e.g., tool call parser, reasoning parser) — users can disable if needed
- •Consider hardware-specific flags (e.g., AMD triton backend)
For models with multiple variants, add selector options:
- •
modelSizeoption: e.g.,{ id: '480b', label: '480B', subtitle: 'MOE' }— maps to differentmodelConfigsentries with per-hardware tp/ep - •
quantizationoption: e.g.,{ id: 'bf16', label: 'BF16' }, { id: 'fp8', label: 'FP8' }— affects model name suffix (e.g.,-FP8) and may require extra flags (e.g.,--ep 2,--trust-remote-code) - •
thinkingoption: e.g.,{ id: 'instruct', label: 'Instruct' }, { id: 'thinking', label: 'Thinking' }— affects model name suffix and reasoning parser
See Qwen3CoderConfigGenerator (multi-size + quantization) and Qwen3NextConfigGenerator (quantization + thinking) for reference patterns.
Step 4: Add model configuration
Create data/models/src/<version>/<modelname>.yaml to define hardware configs and parallelism strategies. Use the SGLang version from Phase 1. Create the version directory if it doesn't exist.
Phase 3: Compile, Validate, and Start Dev Server
Run compilation and validation:
source .venv/bin/activate && python data/scripts/compile_models.py cd data/schema && npm install && npm test
Start dev server to verify rendering:
npm start
Verify the new page renders correctly at http://localhost:3000.
Phase 4: Interactive Testing
The user will deploy the model and run the test scripts from the documentation page. They will paste results back, and we update the TODO placeholders with actual outputs.
This covers:
- •Model Invocation — code generation, streaming, tool calling results
- •Speed Benchmarks — latency and throughput results
- •Accuracy Benchmarks — GSM8K, MMLU results
Phase 5: Configuration Tips
Ask the user for any extra information:
- •Recommended settings
- •Known issues or limitations
- •Optimization tips
Add these to the documentation.
Phase 6: Final Review
Review the complete documentation for:
- •Nested code block formatting (use ```````` for outer blocks containing
```) - •Consistent port numbers across all commands
- •No duplicate deployment commands (reference the one at the top of Section 4)
- •All
TODOplaceholders replaced with actual results