Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00
2026-05-21 21:17:23 +08:00
Description
No description provided
2 MiB
Languages
Rust 67.5%
Python 15.1%
Cuda 13.5%
Shell 3.9%