Files
xserv/docs/06-model-loading.md
Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00

2.4 KiB
Raw Permalink Blame History

Phase 6: Model Loading — Design Document

Goal

从 HuggingFace safetensors 文件加载模型权重到 GPU Tensor。解析 config.json 获取模型结构参数。

Crate: xserv-model

crates/xserv-model/src/
├── lib.rs
├── config.rs       # ModelConfig from config.json
├── loader.rs       # safetensors weight loading
└── gpt2.rs         # (Phase 8) GPT-2 model definition

Dependencies

  • safetensors crate: parse safetensors format
  • serde + serde_json: deserialize config.json
  • memmap2: mmap for zero-copy file access (safetensors uses this internally)

Weight Loading Flow

safetensors file (disk)
  → safetensors crate parses header (tensor names, shapes, dtypes, offsets)
  → mmap raw data
  → for each tensor:
      → read bytes at offset
      → create CPU Tensor from raw bytes
      → .to_device(Cuda(0)) → GPU Tensor
  → return HashMap<String, Tensor>

Config Parsing

#[derive(Deserialize)]
pub struct ModelConfig {
    pub architectures: Option<Vec<String>>,
    pub model_type: Option<String>,
    pub hidden_size: usize,
    pub intermediate_size: Option<usize>,
    pub num_attention_heads: usize,
    pub num_key_value_heads: Option<usize>,
    pub num_hidden_layers: usize,
    pub vocab_size: usize,
    pub max_position_embeddings: Option<usize>,
    pub layer_norm_eps: Option<f64>,
    pub rms_norm_eps: Option<f64>,
    pub rope_theta: Option<f64>,
    pub tie_word_embeddings: Option<bool>,
}

Test Plan

  • Load GPT-2 124M: 160 tensors loaded successfully
  • Parse GPT-2 config.json: hidden=768, layers=12, heads=12, vocab=50257
  • Sharded loading path implemented (for larger models)

Takeaways

  1. GPT-2 vs modern HF config namingGPT-2 uses n_embd/n_head/n_layer/n_positions,而不是 hidden_size/num_attention_heads 等。ModelConfig 需要支持两套命名并提供统一的 accessor methodshidden(), num_heads() 等)。

  2. safetensors 零拷贝读取safetensors crate 直接 mmap 文件,解析 header 得到 tensor 的 offset 和 shape然后 zero-copy 读取 raw bytes。对于 GPT-2 的 500MB 权重文件,加载速度很快。

  3. 模型下载的网络问题HuggingFace 在中国网络下不可达。使用 modelscope.cn 或 hf-mirror.com 作为替代。大文件(>100MB的 redirect 到 CDN 可能也会失败modelscope 的 snapshot_download 更可靠。