Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
58 lines
2.3 KiB
Markdown
58 lines
2.3 KiB
Markdown
# Phase 7: BPE Tokenizer — Design Document
|
||
|
||
## Goal
|
||
|
||
从零实现 Byte-Pair Encoding tokenizer,兼容 HuggingFace `tokenizer.json` 格式。支持 GPT-2 和 Qwen3。
|
||
|
||
## Crate: `xserv-tokenizer`
|
||
|
||
```
|
||
crates/xserv-tokenizer/src/
|
||
├── lib.rs
|
||
├── bpe.rs # BPE encode/decode core algorithm
|
||
└── chat.rs # Chat template formatting
|
||
```
|
||
|
||
## Dependencies
|
||
|
||
- `serde` + `serde_json`: parse tokenizer.json
|
||
- `regex`: pre-tokenization patterns
|
||
|
||
## BPE Algorithm
|
||
|
||
### Encode
|
||
1. Pre-tokenize: split text by regex (GPT-2 pattern)
|
||
2. Each word → byte sequence → initial token list (one token per byte)
|
||
3. Repeatedly merge highest-priority pair until no more merges
|
||
4. Map merged tokens to IDs via vocab
|
||
|
||
### Decode
|
||
Token IDs → lookup vocab → concatenate bytes → UTF-8 decode
|
||
|
||
## Key Data Structures
|
||
|
||
```rust
|
||
pub struct Tokenizer {
|
||
vocab: HashMap<Vec<u8>, u32>, // token bytes → ID
|
||
vocab_rev: Vec<Vec<u8>>, // ID → token bytes
|
||
merges: Vec<(Vec<u8>, Vec<u8>)>, // ordered merge rules
|
||
merge_ranks: HashMap<(u32, u32), usize>, // (id_a, id_b) → priority
|
||
special_tokens: HashMap<String, u32>,
|
||
pre_tokenize_regex: Regex,
|
||
}
|
||
```
|
||
|
||
## Test Plan
|
||
|
||
- [x] Encode + decode roundtrip verified (GPT-2 tokenizer, English text)
|
||
- [x] Special tokens handled (endoftext)
|
||
- [x] Integrated into GPT-2 inference pipeline, generates coherent text
|
||
|
||
## Takeaways
|
||
|
||
1. **GPT-2 byte-to-unicode 映射**:GPT-2 的 vocab 中,每个 byte 都映射到一个 Unicode 字符。可打印 ASCII (0x21-0x7E) 映射到自身,其余字节(空格、控制字符等)映射到 U+0100 以上的 Unicode 码点。解码时需要反向映射。这个映射表是 BPE tokenizer 正确性的关键。
|
||
|
||
2. **Rust regex 不支持 lookahead**:GPT-2 的 pre-tokenization regex 使用了 `(?!\S)` lookahead,Rust 的 `regex` crate 不支持。简化为去掉 lookahead 后功能等价(whitespace 仍然被正确分词)。如果需要精确匹配 Python 行为,需要 `fancy-regex` crate。
|
||
|
||
3. **BPE merge 的 O(n²) 复杂度**:当前实现每次 merge 扫描整个 token 序列找最高优先级 pair,复杂度 O(n² × |merges|)。对于短文本够用,长文本需要 priority queue 优化。推理场景中 prompt 通常 < 10K tokens,暂时可接受。
|