Model Architect (From Scratch)
Use this skill to design a model architecture for pretraining from scratch. It should output practical configs that match compute, memory, and quality goals.
Scale Templates
Provide architecture templates by parameter scale:
- •100M class
- •300M–700M class
- •1B–3B class
- •7B+ class
For each template, define reasonable defaults for:
- •hidden size
- •number of layers
- •number of attention heads
- •intermediate (FFN) size
- •context length
Component Selections
Use modern decoder-only transformer defaults unless user requests otherwise:
- •GQA (Grouped Query Attention) for memory/throughput efficiency at scale.
- •SwiGLU feed-forward activation.
- •RoPE positional encoding.
- •RMSNorm normalization.
When deviating from these defaults, explain why and expected tradeoffs.
Config Output
Emit a Hugging Face Transformers-compatible config.json suitable for training scripts.
Include all required architecture fields and tokenizer special token ids.
Parameter + Memory Calculator
Always provide estimates for:
- •total parameter count (with per-block breakdown)
- •optimizer state memory
- •activation memory (approximate by batch/seq settings)
- •checkpoint size (fp16/bf16 and optional fp32)
Report both training-time and inference-time memory expectations.
Decision Protocol
- •Start with target quality, latency, and budget constraints.
- •Produce at least two candidate designs when tradeoffs are non-trivial.
- •Recommend one primary architecture with rationale.
- •Highlight scaling path (for example 100M -> 1B -> 7B) for future expansion.
Deliverables
- •
config.json(HF-compatible) - •
architecture_report.md(tradeoffs + estimates) - •
capacity_plan.md(GPU/memory implications)