Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.3 KiB
2.3 KiB
Phase 7: BPE Tokenizer — Design Document
Goal
从零实现 Byte-Pair Encoding tokenizer,兼容 HuggingFace tokenizer.json 格式。支持 GPT-2 和 Qwen3。
Crate: xserv-tokenizer
crates/xserv-tokenizer/src/
├── lib.rs
├── bpe.rs # BPE encode/decode core algorithm
└── chat.rs # Chat template formatting
Dependencies
serde+serde_json: parse tokenizer.jsonregex: pre-tokenization patterns
BPE Algorithm
Encode
- Pre-tokenize: split text by regex (GPT-2 pattern)
- Each word → byte sequence → initial token list (one token per byte)
- Repeatedly merge highest-priority pair until no more merges
- Map merged tokens to IDs via vocab
Decode
Token IDs → lookup vocab → concatenate bytes → UTF-8 decode
Key Data Structures
pub struct Tokenizer {
vocab: HashMap<Vec<u8>, u32>, // token bytes → ID
vocab_rev: Vec<Vec<u8>>, // ID → token bytes
merges: Vec<(Vec<u8>, Vec<u8>)>, // ordered merge rules
merge_ranks: HashMap<(u32, u32), usize>, // (id_a, id_b) → priority
special_tokens: HashMap<String, u32>,
pre_tokenize_regex: Regex,
}
Test Plan
- Encode + decode roundtrip verified (GPT-2 tokenizer, English text)
- Special tokens handled (endoftext)
- Integrated into GPT-2 inference pipeline, generates coherent text
Takeaways
-
GPT-2 byte-to-unicode 映射:GPT-2 的 vocab 中,每个 byte 都映射到一个 Unicode 字符。可打印 ASCII (0x21-0x7E) 映射到自身,其余字节(空格、控制字符等)映射到 U+0100 以上的 Unicode 码点。解码时需要反向映射。这个映射表是 BPE tokenizer 正确性的关键。
-
Rust regex 不支持 lookahead:GPT-2 的 pre-tokenization regex 使用了
(?!\S)lookahead,Rust 的regexcrate 不支持。简化为去掉 lookahead 后功能等价(whitespace 仍然被正确分词)。如果需要精确匹配 Python 行为,需要fancy-regexcrate。 -
BPE merge 的 O(n²) 复杂度:当前实现每次 merge 扫描整个 token 序列找最高优先级 pair,复杂度 O(n² × |merges|)。对于短文本够用,长文本需要 priority queue 优化。推理场景中 prompt 通常 < 10K tokens,暂时可接受。