Add New Model Workflow
You are orchestrating the addition of a new model to this llama.cpp Docker project. Follow the phases below in order. Use the specified agent at each phase. Do NOT skip phases unless explicitly told to.
Arguments: The user may provide a candidate name (matching a file in models/documentation/CANDIDATES/) or a HuggingFace model name.
Before you start
- •Read
AI_INSTRUCTIONS.mdfor project context - •Read
docs/gpu-strategy-guide.mdfor GPU placement strategies - •Read
docs/lessons_learned.mdfor known pitfalls - •If a candidate name was given, read the model card from
models/documentation/CANDIDATES/ - •Read
models.confto understand existing profile patterns
Phase 1: Evaluate (model-manager agent)
Use the model-manager agent to analyze the candidate:
- •What kind of model? (Dense vs MoE, params, layers, experts, architecture)
- •What quantized options exist? How large are they?
- •What are the model's specialties/use cases?
- •Does it fit on the hardware? (RTX 4090 24GB + RTX 5070 Ti 16GB + 64GB RAM)
- •Which quant is the best fit? (balance quality vs VRAM vs speed)
You MUST ask the user which quant to use after presenting options with sizes and trade-offs. The user may also tell you upfront which quant they want.
Phase 2: Download & organize (user action)
Tell the user:
- •The exact download command:
huggingface-cli download <repo> <file(s)> --local-dir models/<ModelName>/<QuantName>/ - •Where to place files following existing hierarchy (see
models/structure) - •For multi-part GGUF: all parts must be in the same directory
Wait for the user to confirm files are in place before continuing.
Phase 3: Create production profile (gpu-optimizer agent)
Use the gpu-optimizer agent to create the production profile in models.conf:
The agent will:
- •Read the model card
- •Follow
docs/gpu-strategy-guide.mddecision tree - •Check
docs/lessons_learned.mdfor pitfalls - •Study existing optimized profiles in
models.confas reference - •Calculate VRAM budget (model weights + KV cache at target context + compute buffers)
- •Select strategy (A/B/C/D) and generate
-otregex if needed - •Add production profile to
models.confwith:- •NAME, DESCRIPTION, SPEED (marked as
~estimated) - •MODEL path, CTX_SIZE, N_GPU_LAYERS=99, FIT=off
- •EXTRA_ARGS with
--jinja -np 1and placement flags - •Comments: architecture source, strategy rationale
- •Comment:
# NOT YET TESTED — run ./start.sh <id> and share startup logs
- •NAME, DESCRIPTION, SPEED (marked as
Phase 4: Find sampler settings (research in main conversation)
Research recommended sampler settings for this model:
- •Check the model card for official recommendations
- •Check Unsloth docs if applicable (UD quants)
- •Determine: temperature, top_p, top_k, min_p, repeat_penalty
- •Check if the model needs special system prompt handling (like GPT-OSS reasoning levels)
- •Add sampler flags to EXTRA_ARGS in the production profile
- •Add a new section to
docs/client-settings.mdwith settings and explanation
Phase 5: Test (user + gpu-optimizer agent)
- •Tell the user to run
./start.sh <new-profile-id> - •Ask the user to share the server startup log
- •Use the gpu-optimizer agent to analyze:
- •Did it load successfully? Any OOM?
- •VRAM usage per GPU — room for optimization?
- •Actual speed (t/s) — update SPEED field
- •Graph splits — acceptable?
- •Iterate if needed: adjust layer split, batch sizes, etc.
- •Remove
# NOT YET TESTEDcomment, add actual measured values
Phase 6: Create bench profile (optional, gpu-optimizer + benchmark agents)
Ask the user if they want a bench profile. If yes:
Use the gpu-optimizer agent to create a bench profile in models.conf:
- •Section ID:
[bench-<model-id>] - •CTX_SIZE=10240 (HumanEval worst case ~8.4K)
- •
-b 512 -ub 512(small prompts) - •More GPU layers than production (less KV cache = more room for weights)
- •
--reasoning-format noneif it's a thinking model (chain-of-thought must go into content field for evalplus) - •No sampler args (evalplus sends temperature=0)
Use the benchmark agent to update pipeline files:
- •
benchmarks/evalplus/bench-client.conf— add section if model needs a system prompt - •
benchmarks/evalplus/generate-report.py— add toDISPLAY_NAMESdict andREFERENCE_MAPdict (if official scores exist) - •
benchmarks/evalplus/README.md— add to profiles table
Note: benchmark.sh auto-discovers bench-* profiles from models.conf — no code changes needed there.
Phase 7: Run benchmark (optional, user action)
If the user wants to benchmark:
cd benchmarks/evalplus source .venv/bin/activate ./benchmark.sh bench-<model-id>
Results auto-generated in results/REPORT.md. Use the benchmark agent to help interpret results.
Phase 8: Update documentation (doc-keeper agent)
If the model card was in models/documentation/CANDIDATES/, move it to models/documentation/ now — the model has graduated from candidate to active.
Use the doc-keeper agent to verify and update all cross-references:
Checklist of files to verify:
- •
README.md— Models table, Candidate models table (remove if graduated), Repository Structure - •
AI_INSTRUCTIONS.md— project hierarchy (model card location, new directories) - •
ROADMAP.md— Current Status section - •
docs/client-settings.md— capability table, sampler table, per-model section - •
docs/gpu-strategy-guide.md— model reference table (if applicable) - •
.claude/agents/model-manager.md— directory structure - •
.claude/agents/gpu-optimizer.md— model reference table - •
benchmarks/evalplus/README.md— if bench profile was added
Phase summary
| Phase | Agent | What happens |
|---|---|---|
| 1. Evaluate | model-manager | Analyze model, advise on quant |
| 2. Download | (user) | User downloads and places files |
| 3. Profile | gpu-optimizer | Calculate VRAM, create models.conf entry |
| 4. Samplers | (main conversation) | Research and set sampler settings |
| 5. Test | gpu-optimizer | Analyze logs, iterate on placement |
| 6. Bench profile | gpu-optimizer + benchmark | Create bench config, update pipeline |
| 7. Benchmark | (user) + benchmark | Run EvalPlus, interpret results |
| 8. Docs | doc-keeper | Cross-reference audit, update all docs |