# Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①) ## Goal Wire everything together: load GPT-2 124M, tokenize input, run forward pass, sample tokens, decode output. First time seeing the model "speak". ## Model Architecture (GPT-2 124M) ``` hidden_size = 768 num_heads = 12 num_layers = 12 vocab_size = 50257 max_position_embeddings = 1024 activation = GELU normalization = LayerNorm (pre-LN) tied embeddings (lm_head == wte) ``` ## Forward Pass ``` tokens [S] → wte[tokens] + wpe[0..S] → [S, 768] → for each layer: residual = x x = layernorm(x, ln_1) x = attention(x) # Q,K,V from linear, MHA, output linear x = x + residual residual = x x = layernorm(x, ln_2) x = mlp(x) # linear→GELU→linear x = x + residual → layernorm(x, ln_f) → logits = x @ wte.T → [S, 50257] → sample(logits[-1]) → next token ``` ## Sampling - Greedy: argmax - Temperature: logits / T → softmax → sample - Top-K: keep top-k logits, rest = -inf - Top-P: sorted by prob, cumsum ≤ p ## CLI Binary ``` $ cargo run --release --bin xserv-cli -- --model path/to/gpt2 xserv> The future of AI is GPT-2> ...generated text... ``` ## Test Plan - [x] Greedy generation produces coherent English text - [x] Interactive CLI works (pipe and interactive mode) - [x] Multiple prompts verified: "The future of AI is", "Once upon a time" ## Takeaways 1. **QKV split + head reshape 的 layout 陷阱(最关键的 bug)**:GPT-2 的 `c_attn` 输出 `[S, 3H]` 需要 split 成 Q/K/V 再 reshape 成 `[1, num_heads, S, head_dim]`。关键错误:从 `[S, num_heads, head_dim]` 直接 `reshape` 到 `[1, num_heads, S, head_dim]` 不等于 transpose!Reshape 只是重新解释 flat data 的 shape,不会重排数据。必须手动按 `[batch, head, seq, dim]` 的目标 layout 写入数据。同理 merge_heads 也需要手动重排。 2. **CPU round-trip 作为 correctness first 策略**:`add_tensors`、`add_bias`、`split_qkv`、`merge_heads` 都通过 CPU round-trip 实现。虽然慢(每次都有 GPU→CPU→GPU 拷贝),但确保了正确性。Phase 15 会写专门的 CUDA kernel 替换这些操作。 3. **GPT-2 的 Conv1D 权重布局**:GPT-2 用 `Conv1D` 而非 `Linear`,权重存为 `[in, out]`(不是标准 Linear 的 `[out, in]`)。计算方式是 `x @ weight`(不需要转置)。这和 Qwen3/LLaMA 的 `[out, in]` 布局不同——Phase 10 需要注意。 4. **Greedy decoding 的重复问题**:GPT-2 124M 在 greedy decoding 下极易陷入循环("The world was a place of great danger, and...")。这是已知行为,temperature + top-k/top-p sampling 可以缓解。当前实现只有 greedy,sampling 将在后续添加。 5. **无 KV Cache 的性能代价**:每生成一个 token 都要重新跑完整 forward pass(O(S²) attention)。50 tokens 的生成需要 50 次 full forward,每次的 attention 复杂度还在增长。Phase 9 的 KV Cache 会将 decode 降到 O(S) per token。