Files

Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 22:04:00 +08:00

2.3 KiB

Raw Permalink Blame History

Phase 7: BPE Tokenizer — Design Document

Goal

从零实现 Byte-Pair Encoding tokenizer，兼容 HuggingFace tokenizer.json 格式。支持 GPT-2 和 Qwen3。

Crate: `xserv-tokenizer`

crates/xserv-tokenizer/src/
├── lib.rs
├── bpe.rs          # BPE encode/decode core algorithm
└── chat.rs         # Chat template formatting

Dependencies

serde + serde_json: parse tokenizer.json
regex: pre-tokenization patterns

BPE Algorithm

Encode

Pre-tokenize: split text by regex (GPT-2 pattern)
Each word → byte sequence → initial token list (one token per byte)
Repeatedly merge highest-priority pair until no more merges
Map merged tokens to IDs via vocab

Decode

Token IDs → lookup vocab → concatenate bytes → UTF-8 decode

Key Data Structures

pub struct Tokenizer {
    vocab: HashMap<Vec<u8>, u32>,        // token bytes → ID
    vocab_rev: Vec<Vec<u8>>,             // ID → token bytes
    merges: Vec<(Vec<u8>, Vec<u8>)>,     // ordered merge rules
    merge_ranks: HashMap<(u32, u32), usize>,  // (id_a, id_b) → priority
    special_tokens: HashMap<String, u32>,
    pre_tokenize_regex: Regex,
}

Test Plan

Encode + decode roundtrip verified (GPT-2 tokenizer, English text)
Special tokens handled (endoftext)
Integrated into GPT-2 inference pipeline, generates coherent text

Takeaways

GPT-2 byte-to-unicode 映射：GPT-2 的 vocab 中，每个 byte 都映射到一个 Unicode 字符。可打印 ASCII (0x21-0x7E) 映射到自身，其余字节（空格、控制字符等）映射到 U+0100 以上的 Unicode 码点。解码时需要反向映射。这个映射表是 BPE tokenizer 正确性的关键。
Rust regex 不支持 lookahead：GPT-2 的 pre-tokenization regex 使用了 (?!\S) lookahead，Rust 的 regex crate 不支持。简化为去掉 lookahead 后功能等价（whitespace 仍然被正确分词）。如果需要精确匹配 Python 行为，需要 fancy-regex crate。
BPE merge 的 O(n²) 复杂度：当前实现每次 merge 扫描整个 token 序列找最高优先级 pair，复杂度 O(n² × |merges|)。对于短文本够用，长文本需要 priority queue 优化。推理场景中 prompt 通常 < 10K tokens，暂时可接受。

2.3 KiB Raw Permalink Blame History Unescape Escape

Phase 7: BPE Tokenizer — Design Document

Goal

Crate: xserv-tokenizer

Dependencies

BPE Algorithm

Encode

Decode

Key Data Structures

Test Plan

Takeaways

2.3 KiB

Raw Permalink Blame History

Crate: `xserv-tokenizer`