Files
xserv/docs/08-gpt2.md
Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00

3.0 KiB
Raw Blame History

Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①)

Goal

Wire everything together: load GPT-2 124M, tokenize input, run forward pass, sample tokens, decode output. First time seeing the model "speak".

Model Architecture (GPT-2 124M)

hidden_size = 768
num_heads = 12
num_layers = 12
vocab_size = 50257
max_position_embeddings = 1024
activation = GELU
normalization = LayerNorm (pre-LN)
tied embeddings (lm_head == wte)

Forward Pass

tokens [S]
  → wte[tokens] + wpe[0..S]           → [S, 768]
  → for each layer:
      residual = x
      x = layernorm(x, ln_1)
      x = attention(x)                 # Q,K,V from linear, MHA, output linear
      x = x + residual
      residual = x
      x = layernorm(x, ln_2)
      x = mlp(x)                       # linear→GELU→linear
      x = x + residual
  → layernorm(x, ln_f)
  → logits = x @ wte.T                → [S, 50257]
  → sample(logits[-1])                 → next token

Sampling

  • Greedy: argmax
  • Temperature: logits / T → softmax → sample
  • Top-K: keep top-k logits, rest = -inf
  • Top-P: sorted by prob, cumsum ≤ p

CLI Binary

$ cargo run --release --bin xserv-cli -- --model path/to/gpt2

xserv> The future of AI is
GPT-2> ...generated text...

Test Plan

  • Greedy generation produces coherent English text
  • Interactive CLI works (pipe and interactive mode)
  • Multiple prompts verified: "The future of AI is", "Once upon a time"

Takeaways

  1. QKV split + head reshape 的 layout 陷阱(最关键的 bugGPT-2 的 c_attn 输出 [S, 3H] 需要 split 成 Q/K/V 再 reshape 成 [1, num_heads, S, head_dim]。关键错误:从 [S, num_heads, head_dim] 直接 reshape[1, num_heads, S, head_dim] 不等于 transposeReshape 只是重新解释 flat data 的 shape不会重排数据。必须手动按 [batch, head, seq, dim] 的目标 layout 写入数据。同理 merge_heads 也需要手动重排。

  2. CPU round-trip 作为 correctness first 策略add_tensorsadd_biassplit_qkvmerge_heads 都通过 CPU round-trip 实现。虽然慢(每次都有 GPU→CPU→GPU 拷贝但确保了正确性。Phase 15 会写专门的 CUDA kernel 替换这些操作。

  3. GPT-2 的 Conv1D 权重布局GPT-2 用 Conv1D 而非 Linear,权重存为 [in, out](不是标准 Linear 的 [out, in])。计算方式是 x @ weight(不需要转置)。这和 Qwen3/LLaMA 的 [out, in] 布局不同——Phase 10 需要注意。

  4. Greedy decoding 的重复问题GPT-2 124M 在 greedy decoding 下极易陷入循环("The world was a place of great danger, and..."。这是已知行为temperature + top-k/top-p sampling 可以缓解。当前实现只有 greedysampling 将在后续添加。

  5. 无 KV Cache 的性能代价:每生成一个 token 都要重新跑完整 forward passO(S²) attention。50 tokens 的生成需要 50 次 full forward每次的 attention 复杂度还在增长。Phase 9 的 KV Cache 会将 decode 降到 O(S) per token。