Files

Gahow Wang e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 22:04:00 +08:00

3.0 KiB

Raw Blame History

Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①)

Goal

Wire everything together: load GPT-2 124M, tokenize input, run forward pass, sample tokens, decode output. First time seeing the model "speak".

Model Architecture (GPT-2 124M)

hidden_size = 768
num_heads = 12
num_layers = 12
vocab_size = 50257
max_position_embeddings = 1024
activation = GELU
normalization = LayerNorm (pre-LN)
tied embeddings (lm_head == wte)

Forward Pass

tokens [S]
  → wte[tokens] + wpe[0..S]           → [S, 768]
  → for each layer:
      residual = x
      x = layernorm(x, ln_1)
      x = attention(x)                 # Q,K,V from linear, MHA, output linear
      x = x + residual
      residual = x
      x = layernorm(x, ln_2)
      x = mlp(x)                       # linear→GELU→linear
      x = x + residual
  → layernorm(x, ln_f)
  → logits = x @ wte.T                → [S, 50257]
  → sample(logits[-1])                 → next token

Sampling

Greedy: argmax
Temperature: logits / T → softmax → sample
Top-K: keep top-k logits, rest = -inf
Top-P: sorted by prob, cumsum ≤ p

CLI Binary

$ cargo run --release --bin xserv-cli -- --model path/to/gpt2

xserv> The future of AI is
GPT-2> ...generated text...

Test Plan

Greedy generation produces coherent English text
Interactive CLI works (pipe and interactive mode)
Multiple prompts verified: "The future of AI is", "Once upon a time"

Takeaways

QKV split + head reshape 的 layout 陷阱（最关键的 bug）：GPT-2 的 c_attn 输出 [S, 3H] 需要 split 成 Q/K/V 再 reshape 成 [1, num_heads, S, head_dim]。关键错误：从 [S, num_heads, head_dim] 直接 reshape 到 [1, num_heads, S, head_dim] 不等于 transpose！Reshape 只是重新解释 flat data 的 shape，不会重排数据。必须手动按 [batch, head, seq, dim] 的目标 layout 写入数据。同理 merge_heads 也需要手动重排。
CPU round-trip 作为 correctness first 策略：add_tensors、add_bias、split_qkv、merge_heads 都通过 CPU round-trip 实现。虽然慢（每次都有 GPU→CPU→GPU 拷贝），但确保了正确性。Phase 15 会写专门的 CUDA kernel 替换这些操作。
GPT-2 的 Conv1D 权重布局：GPT-2 用 Conv1D 而非 Linear，权重存为 [in, out]（不是标准 Linear 的 [out, in]）。计算方式是 x @ weight（不需要转置）。这和 Qwen3/LLaMA 的 [out, in] 布局不同——Phase 10 需要注意。
Greedy decoding 的重复问题：GPT-2 124M 在 greedy decoding 下极易陷入循环（"The world was a place of great danger, and..."）。这是已知行为，temperature + top-k/top-p sampling 可以缓解。当前实现只有 greedy，sampling 将在后续添加。
无 KV Cache 的性能代价：每生成一个 token 都要重新跑完整 forward pass（O(S²) attention）。50 tokens 的生成需要 50 次 full forward，每次的 attention 复杂度还在增长。Phase 9 的 KV Cache 会将 decode 降到 O(S) per token。

3.0 KiB Raw Blame History Unescape Escape

Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①)

Goal

Model Architecture (GPT-2 124M)

Forward Pass

Sampling

CLI Binary

Test Plan

Takeaways

3.0 KiB

Raw Blame History