Files
xserv/docs/07-tokenizer.md
Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00

58 lines
2.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 7: BPE Tokenizer — Design Document
## Goal
从零实现 Byte-Pair Encoding tokenizer兼容 HuggingFace `tokenizer.json` 格式。支持 GPT-2 和 Qwen3。
## Crate: `xserv-tokenizer`
```
crates/xserv-tokenizer/src/
├── lib.rs
├── bpe.rs # BPE encode/decode core algorithm
└── chat.rs # Chat template formatting
```
## Dependencies
- `serde` + `serde_json`: parse tokenizer.json
- `regex`: pre-tokenization patterns
## BPE Algorithm
### Encode
1. Pre-tokenize: split text by regex (GPT-2 pattern)
2. Each word → byte sequence → initial token list (one token per byte)
3. Repeatedly merge highest-priority pair until no more merges
4. Map merged tokens to IDs via vocab
### Decode
Token IDs → lookup vocab → concatenate bytes → UTF-8 decode
## Key Data Structures
```rust
pub struct Tokenizer {
vocab: HashMap<Vec<u8>, u32>, // token bytes → ID
vocab_rev: Vec<Vec<u8>>, // ID → token bytes
merges: Vec<(Vec<u8>, Vec<u8>)>, // ordered merge rules
merge_ranks: HashMap<(u32, u32), usize>, // (id_a, id_b) → priority
special_tokens: HashMap<String, u32>,
pre_tokenize_regex: Regex,
}
```
## Test Plan
- [x] Encode + decode roundtrip verified (GPT-2 tokenizer, English text)
- [x] Special tokens handled (endoftext)
- [x] Integrated into GPT-2 inference pipeline, generates coherent text
## Takeaways
1. **GPT-2 byte-to-unicode 映射**GPT-2 的 vocab 中,每个 byte 都映射到一个 Unicode 字符。可打印 ASCII (0x21-0x7E) 映射到自身,其余字节(空格、控制字符等)映射到 U+0100 以上的 Unicode 码点。解码时需要反向映射。这个映射表是 BPE tokenizer 正确性的关键。
2. **Rust regex 不支持 lookahead**GPT-2 的 pre-tokenization regex 使用了 `(?!\S)` lookaheadRust 的 `regex` crate 不支持。简化为去掉 lookahead 后功能等价whitespace 仍然被正确分词)。如果需要精确匹配 Python 行为,需要 `fancy-regex` crate。
3. **BPE merge 的 O(n²) 复杂度**:当前实现每次 merge 扫描整个 token 序列找最高优先级 pair复杂度 O(n² × |merges|)。对于短文本够用,长文本需要 priority queue 优化。推理场景中 prompt 通常 < 10K tokens暂时可接受