Tokenizer Trainer (BPE)
Use this skill to train a tokenizer for new model pretraining. Tokenizer quality is treated as a first-class performance lever, not an afterthought.
Defaults
- •Preferred algorithms: BPE via Hugging Face
tokenizersor SentencePiece BPE. - •Vocabulary size must be user-configurable in the 32K to 50K range.
- •Train on cleaned corpus samples representative of final pretraining distribution.
Required Special Tokens
Always include these special tokens:
- •
<BOS> - •
<EOS> - •
<PAD> - •
<UNK>
Keep ids stable once training starts to avoid checkpoint/tokenizer mismatch.
Training Procedure
- •Sample training text from balanced cleaned corpus.
- •Normalize text consistently with downstream data loader assumptions.
- •Train candidate tokenizer(s) with multiple vocab sizes when needed.
- •Compare compression efficiency, OOV behavior, and domain fragmentation.
- •Select final tokenizer and freeze artifacts for training.
Evaluation Checklist
- •Token-per-byte ratio and sequence length distribution.
- •Domain-specific tokenization quality (e.g., code, math, biomedical terms).
- •Unknown token pressure and fragmentation hotspots.
- •Compatibility with model context window and training throughput goals.
Output Format (HF-Compatible)
Export artifacts in Hugging Face-compatible layout:
- •
tokenizer.json - •
tokenizer_config.json - •
special_tokens_map.json - •
vocab.json+merges.txt(for BPE where applicable)
Optionally include a small validation report (tokenizer_report.md) with metrics and chosen settings.
Tools/Libraries
- •Hugging Face
tokenizers - •
sentencepiece(BPE mode) - •
transformers(for compatibility checks)