Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
70 lines
2.4 KiB
Markdown
70 lines
2.4 KiB
Markdown
# Phase 6: Model Loading — Design Document
|
||
|
||
## Goal
|
||
|
||
从 HuggingFace safetensors 文件加载模型权重到 GPU Tensor。解析 config.json 获取模型结构参数。
|
||
|
||
## Crate: `xserv-model`
|
||
|
||
```
|
||
crates/xserv-model/src/
|
||
├── lib.rs
|
||
├── config.rs # ModelConfig from config.json
|
||
├── loader.rs # safetensors weight loading
|
||
└── gpt2.rs # (Phase 8) GPT-2 model definition
|
||
```
|
||
|
||
## Dependencies
|
||
|
||
- `safetensors` crate: parse safetensors format
|
||
- `serde` + `serde_json`: deserialize config.json
|
||
- `memmap2`: mmap for zero-copy file access (safetensors uses this internally)
|
||
|
||
## Weight Loading Flow
|
||
|
||
```
|
||
safetensors file (disk)
|
||
→ safetensors crate parses header (tensor names, shapes, dtypes, offsets)
|
||
→ mmap raw data
|
||
→ for each tensor:
|
||
→ read bytes at offset
|
||
→ create CPU Tensor from raw bytes
|
||
→ .to_device(Cuda(0)) → GPU Tensor
|
||
→ return HashMap<String, Tensor>
|
||
```
|
||
|
||
## Config Parsing
|
||
|
||
```rust
|
||
#[derive(Deserialize)]
|
||
pub struct ModelConfig {
|
||
pub architectures: Option<Vec<String>>,
|
||
pub model_type: Option<String>,
|
||
pub hidden_size: usize,
|
||
pub intermediate_size: Option<usize>,
|
||
pub num_attention_heads: usize,
|
||
pub num_key_value_heads: Option<usize>,
|
||
pub num_hidden_layers: usize,
|
||
pub vocab_size: usize,
|
||
pub max_position_embeddings: Option<usize>,
|
||
pub layer_norm_eps: Option<f64>,
|
||
pub rms_norm_eps: Option<f64>,
|
||
pub rope_theta: Option<f64>,
|
||
pub tie_word_embeddings: Option<bool>,
|
||
}
|
||
```
|
||
|
||
## Test Plan
|
||
|
||
- [x] Load GPT-2 124M: 160 tensors loaded successfully
|
||
- [x] Parse GPT-2 config.json: hidden=768, layers=12, heads=12, vocab=50257
|
||
- [x] Sharded loading path implemented (for larger models)
|
||
|
||
## Takeaways
|
||
|
||
1. **GPT-2 vs modern HF config naming**:GPT-2 uses `n_embd`/`n_head`/`n_layer`/`n_positions`,而不是 `hidden_size`/`num_attention_heads` 等。ModelConfig 需要支持两套命名并提供统一的 accessor methods(`hidden()`, `num_heads()` 等)。
|
||
|
||
2. **safetensors 零拷贝读取**:`safetensors` crate 直接 mmap 文件,解析 header 得到 tensor 的 offset 和 shape,然后 zero-copy 读取 raw bytes。对于 GPT-2 的 500MB 权重文件,加载速度很快。
|
||
|
||
3. **模型下载的网络问题**:HuggingFace 在中国网络下不可达。使用 modelscope.cn 或 hf-mirror.com 作为替代。大文件(>100MB)的 redirect 到 CDN 可能也会失败,modelscope 的 snapshot_download 更可靠。
|