Tokenizer Design
When to Use
Design or modify tokenizers when training a model from scratch, adapting a model to a new domain/language with poor tokenization coverage, or optimizing inference efficiency via vocabulary tuning.
Algorithm Selection
Decision Table: Tokenizer Algorithm
| Algorithm | Library | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| BPE | tokenizers, tiktoken | Deterministic, widely adopted | Greedy merges can miss global optima | GPT-family, general LLMs |
| WordPiece | tokenizers | Likelihood-driven merges | Slower training than BPE | BERT-family models |
| Unigram | SentencePiece | Probabilistic, multiple segmentations | More complex implementation | Multilingual, T5/XLNet |
| SentencePiece (BPE) | sentencepiece | Language-agnostic, raw text input | Less control over pre-tokenization | Multilingual, non-space languages |
| Byte-level BPE | tokenizers | No UNK tokens, full coverage | Longer sequences for non-Latin scripts | GPT-2/3/4, Llama |
Decision Table: Vocabulary Size
| Vocab Size | Token Fertility | Training Cost | Best For |
|---|---|---|---|
| 8K-16K | High (more tokens/word) | Low | Small models, single language |
| 32K | Balanced | Medium | General monolingual (BERT, GPT-2) |
| 50K-64K | Lower | Medium-High | Large monolingual LLMs |
| 100K-128K | Low | High | Multilingual models |
| 200K+ | Very low | Very high | Massive multilingual (Gemma) |
Training BPE from Scratch
Using HuggingFace tokenizers Library
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors
def train_bpe_tokenizer(
files: list[str],
vocab_size: int = 32000,
min_frequency: int = 2,
output_path: str = "tokenizer.json",
):
"""Train a byte-level BPE tokenizer from scratch."""
tokenizer = Tokenizer(models.BPE())
# Byte-level pre-tokenization (GPT-2 style)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
min_frequency=min_frequency,
special_tokens=["<|endoftext|>", "<|pad|>", "<|begin|>", "<|end|>"],
show_progress=True,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
tokenizer.train(files, trainer)
tokenizer.save(output_path)
return tokenizer
# Usage
tokenizer = train_bpe_tokenizer(
files=["corpus_part1.txt", "corpus_part2.txt"],
vocab_size=32000,
)
# Verify
encoded = tokenizer.encode("Hello, world\! This is a test.")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")
SentencePiece Training
import sentencepiece as spm
def train_sentencepiece(
input_file: str,
model_prefix: str = "sp_model",
vocab_size: int = 32000,
model_type: str = "unigram", # or "bpe"
character_coverage: float = 0.9995,
):
"""Train a SentencePiece model.
character_coverage: 1.0 for Latin-heavy, 0.9995 for multilingual.
"""
spm.SentencePieceTrainer.train(
input=input_file,
model_prefix=model_prefix,
vocab_size=vocab_size,
model_type=model_type,
character_coverage=character_coverage,
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
# Byte fallback for unknown characters
byte_fallback=True,
# Normalization
normalization_rule_name="identity", # or "nfkc"
# Training efficiency
input_sentence_size=5_000_000,
shuffle_input_sentence=True,
# Rare word handling
max_sentencepiece_length=16,
split_digits=True,
allow_whitespace_only_pieces=True,
)
sp = spm.SentencePieceProcessor(model_file=f"{model_prefix}.model")
return sp
# Usage
sp = train_sentencepiece("corpus.txt", vocab_size=32000, model_type="unigram")
print(sp.encode("Hello world", out_type=str)) # ['_Hello', '_world']
print(sp.encode("Hello world", out_type=int)) # [1234, 5678]
Extending an Existing Tokenizer
Adding Domain-Specific Tokens
from transformers import AutoTokenizer
def extend_tokenizer(
base_model: str,
new_tokens: list[str],
output_dir: str = "extended_tokenizer",
):
"""Extend a pretrained tokenizer with domain-specific tokens.
Returns the tokenizer and number of tokens added (for model resize).
"""
tokenizer = AutoTokenizer.from_pretrained(base_model)
# Check fertility before adding
for token in new_tokens[:5]:
encoded = tokenizer.tokenize(token)
print(f" '{token}' -> {encoded} ({len(encoded)} tokens)")
num_added = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added} tokens. New vocab size: {len(tokenizer)}")
tokenizer.save_pretrained(output_dir)
return tokenizer, num_added
Fertility Analysis
Measuring Tokenizer Efficiency
from collections import Counter
import numpy as np
def fertility_analysis(tokenizer, texts: list[str]):
"""Analyze tokenizer fertility (tokens per word) on a corpus.
Lower fertility = more efficient tokenization.
"""
total_tokens = 0
total_words = 0
token_lengths = []
unk_count = 0
token_freq = Counter()
unk_token_id = getattr(tokenizer, "unk_token_id", None)
for text in texts:
words = text.split()
total_words += len(words)
encoded = tokenizer.encode(text)
total_tokens += len(encoded)
token_lengths.append(len(encoded))
if unk_token_id is not None:
unk_count += encoded.count(unk_token_id)
tokens = tokenizer.convert_ids_to_tokens(encoded)
token_freq.update(tokens)
fertility = total_tokens / max(total_words, 1)
unk_rate = unk_count / max(total_tokens, 1)
unused_vocab = len(tokenizer) - len(token_freq)
print(f"Fertility (tokens/word): {fertility:.2f}")
print(f"UNK rate: {unk_rate:.4%}")
print(f"Avg sequence length: {np.mean(token_lengths):.1f}")
print(f"Unused vocab tokens: {unused_vocab} / {len(tokenizer)}")
print(f"Top 20 tokens: {token_freq.most_common(20)}")
return {
"fertility": fertility,
"unk_rate": unk_rate,
"avg_seq_len": np.mean(token_lengths),
"unused_vocab": unused_vocab,
}
Gotchas and Anti-Patterns
Vocab Size vs. Performance Tradeoffs
Larger vocab reduces sequence length (lower fertility) but increases embedding table size and softmax computation. For a 100K vocab model, the embedding matrix alone is ~400MB at float32. Diminishing returns above 64K for monolingual English. Profile actual downstream task performance, not just fertility.
Special Token Handling
Special tokens must be consistent between tokenizer and model config. Common failure: adding <tool_call> to the tokenizer but forgetting to update model.config.special_tokens_map. Also: never place special tokens in positions that the model's positional encoding doesn't cover. Always verify with tokenizer.all_special_tokens after modification.
Tokenizer-Model Mismatch
Using a tokenizer trained on a different corpus than the model causes silent performance degradation. The model's embedding layer learned representations for the original tokenizer's vocabulary distribution. Swapping tokenizers without retraining embeddings is almost always wrong. When extending, fine-tune the model on domain data after resizing embeddings.
Byte Fallback Behavior
SentencePiece byte_fallback=True encodes unknown characters as byte sequences (e.g., <0xE2><0x80><0x99> for a curly quote). This eliminates UNK tokens but inflates sequence length for out-of-distribution scripts. A CJK-heavy input through a Latin-trained tokenizer with byte fallback can produce 3-4x longer sequences than a properly trained multilingual tokenizer.
Pre-tokenization Leakage
Pre-tokenization rules (whitespace splitting, punctuation handling) bake in language assumptions. English-centric pre-tokenizers break on CJK (no spaces), agglutinative languages (Turkish, Finnish), or code (meaningful whitespace). SentencePiece avoids this by operating on raw text, but loses control over word boundary behavior.
Normalization Side Effects
NFKC normalization (SentencePiece default) maps visually similar characters to canonical forms. This silently converts characters in code (full-width plus to +), math symbols, and CJK variants. Use identity normalization for code-heavy or multilingual corpora where character preservation matters.
Training Data Bias in Vocabulary
The tokenizer's merge rules reflect training corpus frequency distribution. A tokenizer trained on English Wikipedia produces poor subwords for medical text, legal documents, or code. Domain-specific tokenizer training or targeted vocabulary extension is necessary when fertility spikes above 2.5x the baseline on domain text.