Files
xserv/docs/07-tokenizer.md
Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00

2.3 KiB
Raw Permalink Blame History

Phase 7: BPE Tokenizer — Design Document

Goal

从零实现 Byte-Pair Encoding tokenizer兼容 HuggingFace tokenizer.json 格式。支持 GPT-2 和 Qwen3。

Crate: xserv-tokenizer

crates/xserv-tokenizer/src/
├── lib.rs
├── bpe.rs          # BPE encode/decode core algorithm
└── chat.rs         # Chat template formatting

Dependencies

  • serde + serde_json: parse tokenizer.json
  • regex: pre-tokenization patterns

BPE Algorithm

Encode

  1. Pre-tokenize: split text by regex (GPT-2 pattern)
  2. Each word → byte sequence → initial token list (one token per byte)
  3. Repeatedly merge highest-priority pair until no more merges
  4. Map merged tokens to IDs via vocab

Decode

Token IDs → lookup vocab → concatenate bytes → UTF-8 decode

Key Data Structures

pub struct Tokenizer {
    vocab: HashMap<Vec<u8>, u32>,        // token bytes → ID
    vocab_rev: Vec<Vec<u8>>,             // ID → token bytes
    merges: Vec<(Vec<u8>, Vec<u8>)>,     // ordered merge rules
    merge_ranks: HashMap<(u32, u32), usize>,  // (id_a, id_b) → priority
    special_tokens: HashMap<String, u32>,
    pre_tokenize_regex: Regex,
}

Test Plan

  • Encode + decode roundtrip verified (GPT-2 tokenizer, English text)
  • Special tokens handled (endoftext)
  • Integrated into GPT-2 inference pipeline, generates coherent text

Takeaways

  1. GPT-2 byte-to-unicode 映射GPT-2 的 vocab 中,每个 byte 都映射到一个 Unicode 字符。可打印 ASCII (0x21-0x7E) 映射到自身,其余字节(空格、控制字符等)映射到 U+0100 以上的 Unicode 码点。解码时需要反向映射。这个映射表是 BPE tokenizer 正确性的关键。

  2. Rust regex 不支持 lookaheadGPT-2 的 pre-tokenization regex 使用了 (?!\S) lookaheadRust 的 regex crate 不支持。简化为去掉 lookahead 后功能等价whitespace 仍然被正确分词)。如果需要精确匹配 Python 行为,需要 fancy-regex crate。

  3. BPE merge 的 O(n²) 复杂度:当前实现每次 merge 扫描整个 token 序列找最高优先级 pair复杂度 O(n² × |merges|)。对于短文本够用,长文本需要 priority queue 优化。推理场景中 prompt 通常 < 10K tokens暂时可接受。