Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.0 KiB
Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①)
Goal
Wire everything together: load GPT-2 124M, tokenize input, run forward pass, sample tokens, decode output. First time seeing the model "speak".
Model Architecture (GPT-2 124M)
hidden_size = 768
num_heads = 12
num_layers = 12
vocab_size = 50257
max_position_embeddings = 1024
activation = GELU
normalization = LayerNorm (pre-LN)
tied embeddings (lm_head == wte)
Forward Pass
tokens [S]
→ wte[tokens] + wpe[0..S] → [S, 768]
→ for each layer:
residual = x
x = layernorm(x, ln_1)
x = attention(x) # Q,K,V from linear, MHA, output linear
x = x + residual
residual = x
x = layernorm(x, ln_2)
x = mlp(x) # linear→GELU→linear
x = x + residual
→ layernorm(x, ln_f)
→ logits = x @ wte.T → [S, 50257]
→ sample(logits[-1]) → next token
Sampling
- Greedy: argmax
- Temperature: logits / T → softmax → sample
- Top-K: keep top-k logits, rest = -inf
- Top-P: sorted by prob, cumsum ≤ p
CLI Binary
$ cargo run --release --bin xserv-cli -- --model path/to/gpt2
xserv> The future of AI is
GPT-2> ...generated text...
Test Plan
- Greedy generation produces coherent English text
- Interactive CLI works (pipe and interactive mode)
- Multiple prompts verified: "The future of AI is", "Once upon a time"
Takeaways
-
QKV split + head reshape 的 layout 陷阱(最关键的 bug):GPT-2 的
c_attn输出[S, 3H]需要 split 成 Q/K/V 再 reshape 成[1, num_heads, S, head_dim]。关键错误:从[S, num_heads, head_dim]直接reshape到[1, num_heads, S, head_dim]不等于 transpose!Reshape 只是重新解释 flat data 的 shape,不会重排数据。必须手动按[batch, head, seq, dim]的目标 layout 写入数据。同理 merge_heads 也需要手动重排。 -
CPU round-trip 作为 correctness first 策略:
add_tensors、add_bias、split_qkv、merge_heads都通过 CPU round-trip 实现。虽然慢(每次都有 GPU→CPU→GPU 拷贝),但确保了正确性。Phase 15 会写专门的 CUDA kernel 替换这些操作。 -
GPT-2 的 Conv1D 权重布局:GPT-2 用
Conv1D而非Linear,权重存为[in, out](不是标准 Linear 的[out, in])。计算方式是x @ weight(不需要转置)。这和 Qwen3/LLaMA 的[out, in]布局不同——Phase 10 需要注意。 -
Greedy decoding 的重复问题:GPT-2 124M 在 greedy decoding 下极易陷入循环("The world was a place of great danger, and...")。这是已知行为,temperature + top-k/top-p sampling 可以缓解。当前实现只有 greedy,sampling 将在后续添加。
-
无 KV Cache 的性能代价:每生成一个 token 都要重新跑完整 forward pass(O(S²) attention)。50 tokens 的生成需要 50 次 full forward,每次的 attention 复杂度还在增长。Phase 9 的 KV Cache 会将 decode 降到 O(S) per token。