# Phase 8: GPT-2 Complete Inference — Design Document (Milestone ①)

## Goal

Wire everything together: load GPT-2 124M, tokenize input, run forward pass, sample tokens, decode output. First time seeing the model "speak".

## Model Architecture (GPT-2 124M)

```
hidden_size = 768
num_heads = 12
num_layers = 12
vocab_size = 50257
max_position_embeddings = 1024
activation = GELU
normalization = LayerNorm (pre-LN)
tied embeddings (lm_head == wte)
```

## Forward Pass

```
tokens [S]
  → wte[tokens] + wpe[0..S]           → [S, 768]
  → for each layer:
      residual = x
      x = layernorm(x, ln_1)
      x = attention(x)                 # Q,K,V from linear, MHA, output linear
      x = x + residual
      residual = x
      x = layernorm(x, ln_2)
      x = mlp(x)                       # linear→GELU→linear
      x = x + residual
  → layernorm(x, ln_f)
  → logits = x @ wte.T                → [S, 50257]
  → sample(logits[-1])                 → next token
```

## Sampling

- Greedy: argmax
- Temperature: logits / T → softmax → sample
- Top-K: keep top-k logits, rest = -inf
- Top-P: sorted by prob, cumsum ≤ p

## CLI Binary

```
$ cargo run --release --bin xserv-cli -- --model path/to/gpt2

xserv> The future of AI is
GPT-2> ...generated text...
```

## Test Plan

- [x] Greedy generation produces coherent English text
- [x] Interactive CLI works (pipe and interactive mode)
- [x] Multiple prompts verified: "The future of AI is", "Once upon a time"

## Takeaways

1. **QKV split + head reshape 的 layout 陷阱（最关键的 bug）**：GPT-2 的 `c_attn` 输出 `[S, 3H]` 需要 split 成 Q/K/V 再 reshape 成 `[1, num_heads, S, head_dim]`。关键错误：从 `[S, num_heads, head_dim]` 直接 `reshape` 到 `[1, num_heads, S, head_dim]` 不等于 transpose！Reshape 只是重新解释 flat data 的 shape，不会重排数据。必须手动按 `[batch, head, seq, dim]` 的目标 layout 写入数据。同理 merge_heads 也需要手动重排。

2. **CPU round-trip 作为 correctness first 策略**：`add_tensors`、`add_bias`、`split_qkv`、`merge_heads` 都通过 CPU round-trip 实现。虽然慢（每次都有 GPU→CPU→GPU 拷贝），但确保了正确性。Phase 15 会写专门的 CUDA kernel 替换这些操作。

3. **GPT-2 的 Conv1D 权重布局**：GPT-2 用 `Conv1D` 而非 `Linear`，权重存为 `[in, out]`（不是标准 Linear 的 `[out, in]`）。计算方式是 `x @ weight`（不需要转置）。这和 Qwen3/LLaMA 的 `[out, in]` 布局不同——Phase 10 需要注意。

4. **Greedy decoding 的重复问题**：GPT-2 124M 在 greedy decoding 下极易陷入循环（"The world was a place of great danger, and..."）。这是已知行为，temperature + top-k/top-p sampling 可以缓解。当前实现只有 greedy，sampling 将在后续添加。

5. **无 KV Cache 的性能代价**：每生成一个 token 都要重新跑完整 forward pass（O(S²) attention）。50 tokens 的生成需要 50 次 full forward，每次的 attention 复杂度还在增长。Phase 9 的 KV Cache 会将 decode 降到 O(S) per token。