AgentSkillsCN

tokenizer-design

训练 BPE、WordPiece、SentencePiece 以及 Unigram 分词器,优化词汇表,拓展领域覆盖范围,同时支持多语言设计。

SKILL.md
--- frontmatter
name: tokenizer-design
description: BPE, WordPiece, SentencePiece, and Unigram tokenizer training, vocabulary optimization, domain extension, and multilingual design.

Tokenizer Design

When to Use

Design or modify tokenizers when training a model from scratch, adapting a model to a new domain/language with poor tokenization coverage, or optimizing inference efficiency via vocabulary tuning.

Algorithm Selection

Decision Table: Tokenizer Algorithm

AlgorithmLibraryStrengthsWeaknessesBest For
BPEtokenizers, tiktokenDeterministic, widely adoptedGreedy merges can miss global optimaGPT-family, general LLMs
WordPiecetokenizersLikelihood-driven mergesSlower training than BPEBERT-family models
UnigramSentencePieceProbabilistic, multiple segmentationsMore complex implementationMultilingual, T5/XLNet
SentencePiece (BPE)sentencepieceLanguage-agnostic, raw text inputLess control over pre-tokenizationMultilingual, non-space languages
Byte-level BPEtokenizersNo UNK tokens, full coverageLonger sequences for non-Latin scriptsGPT-2/3/4, Llama

Decision Table: Vocabulary Size

Vocab SizeToken FertilityTraining CostBest For
8K-16KHigh (more tokens/word)LowSmall models, single language
32KBalancedMediumGeneral monolingual (BERT, GPT-2)
50K-64KLowerMedium-HighLarge monolingual LLMs
100K-128KLowHighMultilingual models
200K+Very lowVery highMassive multilingual (Gemma)

Training BPE from Scratch

Using HuggingFace tokenizers Library

python
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors

def train_bpe_tokenizer(
    files: list[str],
    vocab_size: int = 32000,
    min_frequency: int = 2,
    output_path: str = "tokenizer.json",
):
    """Train a byte-level BPE tokenizer from scratch."""
    tokenizer = Tokenizer(models.BPE())

    # Byte-level pre-tokenization (GPT-2 style)
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
    tokenizer.decoder = decoders.ByteLevel()
    tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=["<|endoftext|>", "<|pad|>", "<|begin|>", "<|end|>"],
        show_progress=True,
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    )

    tokenizer.train(files, trainer)
    tokenizer.save(output_path)
    return tokenizer


# Usage
tokenizer = train_bpe_tokenizer(
    files=["corpus_part1.txt", "corpus_part2.txt"],
    vocab_size=32000,
)

# Verify
encoded = tokenizer.encode("Hello, world\! This is a test.")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")

SentencePiece Training

python
import sentencepiece as spm


def train_sentencepiece(
    input_file: str,
    model_prefix: str = "sp_model",
    vocab_size: int = 32000,
    model_type: str = "unigram",  # or "bpe"
    character_coverage: float = 0.9995,
):
    """Train a SentencePiece model.

    character_coverage: 1.0 for Latin-heavy, 0.9995 for multilingual.
    """
    spm.SentencePieceTrainer.train(
        input=input_file,
        model_prefix=model_prefix,
        vocab_size=vocab_size,
        model_type=model_type,
        character_coverage=character_coverage,
        pad_id=0,
        unk_id=1,
        bos_id=2,
        eos_id=3,
        # Byte fallback for unknown characters
        byte_fallback=True,
        # Normalization
        normalization_rule_name="identity",  # or "nfkc"
        # Training efficiency
        input_sentence_size=5_000_000,
        shuffle_input_sentence=True,
        # Rare word handling
        max_sentencepiece_length=16,
        split_digits=True,
        allow_whitespace_only_pieces=True,
    )

    sp = spm.SentencePieceProcessor(model_file=f"{model_prefix}.model")
    return sp


# Usage
sp = train_sentencepiece("corpus.txt", vocab_size=32000, model_type="unigram")
print(sp.encode("Hello world", out_type=str))   # ['_Hello', '_world']
print(sp.encode("Hello world", out_type=int))   # [1234, 5678]

Extending an Existing Tokenizer

Adding Domain-Specific Tokens

python
from transformers import AutoTokenizer


def extend_tokenizer(
    base_model: str,
    new_tokens: list[str],
    output_dir: str = "extended_tokenizer",
):
    """Extend a pretrained tokenizer with domain-specific tokens.

    Returns the tokenizer and number of tokens added (for model resize).
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model)

    # Check fertility before adding
    for token in new_tokens[:5]:
        encoded = tokenizer.tokenize(token)
        print(f"  '{token}' -> {encoded} ({len(encoded)} tokens)")

    num_added = tokenizer.add_tokens(new_tokens)
    print(f"Added {num_added} tokens. New vocab size: {len(tokenizer)}")

    tokenizer.save_pretrained(output_dir)
    return tokenizer, num_added

Fertility Analysis

Measuring Tokenizer Efficiency

python
from collections import Counter
import numpy as np


def fertility_analysis(tokenizer, texts: list[str]):
    """Analyze tokenizer fertility (tokens per word) on a corpus.

    Lower fertility = more efficient tokenization.
    """
    total_tokens = 0
    total_words = 0
    token_lengths = []
    unk_count = 0
    token_freq = Counter()

    unk_token_id = getattr(tokenizer, "unk_token_id", None)

    for text in texts:
        words = text.split()
        total_words += len(words)

        encoded = tokenizer.encode(text)
        total_tokens += len(encoded)
        token_lengths.append(len(encoded))

        if unk_token_id is not None:
            unk_count += encoded.count(unk_token_id)

        tokens = tokenizer.convert_ids_to_tokens(encoded)
        token_freq.update(tokens)

    fertility = total_tokens / max(total_words, 1)
    unk_rate = unk_count / max(total_tokens, 1)
    unused_vocab = len(tokenizer) - len(token_freq)

    print(f"Fertility (tokens/word): {fertility:.2f}")
    print(f"UNK rate: {unk_rate:.4%}")
    print(f"Avg sequence length: {np.mean(token_lengths):.1f}")
    print(f"Unused vocab tokens: {unused_vocab} / {len(tokenizer)}")
    print(f"Top 20 tokens: {token_freq.most_common(20)}")

    return {
        "fertility": fertility,
        "unk_rate": unk_rate,
        "avg_seq_len": np.mean(token_lengths),
        "unused_vocab": unused_vocab,
    }

Gotchas and Anti-Patterns

Vocab Size vs. Performance Tradeoffs

Larger vocab reduces sequence length (lower fertility) but increases embedding table size and softmax computation. For a 100K vocab model, the embedding matrix alone is ~400MB at float32. Diminishing returns above 64K for monolingual English. Profile actual downstream task performance, not just fertility.

Special Token Handling

Special tokens must be consistent between tokenizer and model config. Common failure: adding <tool_call> to the tokenizer but forgetting to update model.config.special_tokens_map. Also: never place special tokens in positions that the model's positional encoding doesn't cover. Always verify with tokenizer.all_special_tokens after modification.

Tokenizer-Model Mismatch

Using a tokenizer trained on a different corpus than the model causes silent performance degradation. The model's embedding layer learned representations for the original tokenizer's vocabulary distribution. Swapping tokenizers without retraining embeddings is almost always wrong. When extending, fine-tune the model on domain data after resizing embeddings.

Byte Fallback Behavior

SentencePiece byte_fallback=True encodes unknown characters as byte sequences (e.g., <0xE2><0x80><0x99> for a curly quote). This eliminates UNK tokens but inflates sequence length for out-of-distribution scripts. A CJK-heavy input through a Latin-trained tokenizer with byte fallback can produce 3-4x longer sequences than a properly trained multilingual tokenizer.

Pre-tokenization Leakage

Pre-tokenization rules (whitespace splitting, punctuation handling) bake in language assumptions. English-centric pre-tokenizers break on CJK (no spaces), agglutinative languages (Turkish, Finnish), or code (meaningful whitespace). SentencePiece avoids this by operating on raw text, but loses control over word boundary behavior.

Normalization Side Effects

NFKC normalization (SentencePiece default) maps visually similar characters to canonical forms. This silently converts characters in code (full-width plus to +), math symbols, and CJK variants. Use identity normalization for code-heavy or multilingual corpora where character preservation matters.

Training Data Bias in Vocabulary

The tokenizer's merge rules reflect training corpus frequency distribution. A tokenizer trained on English Wikipedia produces poor subwords for medical text, legal documents, or code. Domain-specific tokenizer training or targeted vocabulary extension is necessary when fertility spikes above 2.5x the baseline on domain text.