Model Architect (From Scratch)

Use this skill to design a model architecture for pretraining from scratch. It should output practical configs that match compute, memory, and quality goals.

Scale Templates

Provide architecture templates by parameter scale:

•100M class
•300M–700M class
•1B–3B class
•7B+ class

For each template, define reasonable defaults for:

•hidden size
•number of layers
•number of attention heads
•intermediate (FFN) size
•context length

Component Selections

Use modern decoder-only transformer defaults unless user requests otherwise:

•GQA (Grouped Query Attention) for memory/throughput efficiency at scale.
•SwiGLU feed-forward activation.
•RoPE positional encoding.
•RMSNorm normalization.

When deviating from these defaults, explain why and expected tradeoffs.

Config Output

Emit a Hugging Face Transformers-compatible config.json suitable for training scripts. Include all required architecture fields and tokenizer special token ids.

Parameter + Memory Calculator

Always provide estimates for:

•total parameter count (with per-block breakdown)
•optimizer state memory
•activation memory (approximate by batch/seq settings)
•checkpoint size (fp16/bf16 and optional fp32)

Report both training-time and inference-time memory expectations.

Decision Protocol

•Start with target quality, latency, and budget constraints.
•Produce at least two candidate designs when tradeoffs are non-trivial.
•Recommend one primary architecture with rationale.
•Highlight scaling path (for example 100M -> 1B -> 7B) for future expansion.

Deliverables

•config.json (HF-compatible)
•architecture_report.md (tradeoffs + estimates)
•capacity_plan.md (GPU/memory implications)