fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:53:28 +08:00
parent d8493bd70f
commit ee68d3565d
38 changed files with 3012 additions and 259 deletions
--- a/docs/00-roadmap.md
+++ b/docs/00-roadmap.md
@@ -9,7 +9,7 @@
 | 抽象层级 | Level 0.5 | 自写 CUDA kernel + cuBLAS 可切换，便于 benchmark 对比 |
 | 硬件 | 8×RTX 5090 (Blackwell, CC 12.0, 32GB GDDR7) | 纯 PCIe Gen5 x16 互联，无 NVLink (详见下方硬件拓扑) |
 | 语言 | Rust + CUDA (C/C++) | Rust FFI 调用 CUDA |
-| 起步模型 | GPT-2 124M → Qwen3-7B | 从简单到实用 |
+| 起步模型 | GPT-2 124M → Qwen3-8B | 从简单到实用 |
 | 精度 | BF16/FP16 | 后期扩展 FP8 |
 | Tensor | 自己实现 | 完整学习 tensor 抽象设计 |
 | Tokenizer | 自己实现 BPE | 学习分词机制 |
@@ -101,7 +101,7 @@ Phase 8: GPT-2 完整推理 ◄──────────── 里程碑
    │
 Phase 9: KV Cache + Autoregressive Generation
    │
-Phase 10: Qwen3-7B 支持 ◄─────────── 里程碑 ② 7B 模型推理
+Phase 10: Qwen3-8B 支持 ◄─────────── 里程碑 ② 8B 模型推理
    │
 Phase 11: Paged Attention + KV Cache Manager
    │
@@ -109,7 +109,7 @@ Phase 12: Continuous Batching + Request Scheduler
    │
 Phase 13: HTTP API + SSE Streaming ◄── 里程碑 ③ 端到端 API 可用
    │
-Phase 14: Flash Attention v2
+Phase 14: Flash Attention (FA2 for SM120)
    │
 Phase 15: 性能优化 ◄──────────────── 里程碑 ④ 50% vLLM throughput
    │
@@ -625,8 +625,8 @@ safetensors file (disk)

 - [ ] 加载 GPT-2 124M (`openai-community/gpt2`)，打印所有 tensor name, shape, dtype
 - [ ] 抽查几个 tensor 的前 10 个值，与 PyTorch `from_pretrained` 对比
- [ ] 加载 Qwen3-7B sharded 权重，验证所有 tensor 都成功加载
- [ ] 性能: 测量 7B 模型权重加载时间 (mmap → GPU 全流程)
+- [ ] 加载 Qwen3-8B sharded 权重，验证所有 tensor 都成功加载
+- [ ] 性能: 测量 8B 模型权重加载时间 (mmap → GPU 全流程)
 - [ ] 错误处理: 缺少 tensor、dtype 不匹配、文件不存在等情况

 ---
@@ -869,15 +869,15 @@ weights × V_cache [B, H, S, D] → output [B, H, 1, D]

 ---

-## Phase 10: Qwen3-7B 支持 — 里程碑 ②
+## Phase 10: Qwen3-8B 支持 — 里程碑 ②

 **Crate**: `xserv-model`

-**目标**: 扩展模型定义以支持 Qwen3-7B，验证输出正确性。
+**目标**: 扩展模型定义以支持 Qwen3-8B，验证输出正确性。

 ### 架构对比

-| 特性 | GPT-2 (124M) | Qwen3-7B |
+| 特性 | GPT-2 (124M) | Qwen3-8B |
 |------|-------------|----------|
 | Normalization | LayerNorm (pre-LN) | RMSNorm (pre-LN) |
 | Position Encoding | Learned absolute (wpe) | RoPE (无单独参数) |
@@ -885,8 +885,8 @@ weights × V_cache [B, H, S, D] → output [B, H, 1, D]
 | Activation | GELU | SwiGLU (SiLU gate) |
 | FFN | Linear(H→4H) → GELU → Linear(4H→H) | gate_proj + up_proj → SiLU gate → down_proj |
 | Vocab Size | 50,257 | ~152,000 |
-| Hidden Size | 768 | 3,584 (7B) |
-| Layers | 12 | 28 |
+| Hidden Size | 768 | 4,096 (8B) |
+| Layers | 12 | 36 |
 | Tied Embeddings | Yes | No |

 ### 需要新增/修改的组件
@@ -948,16 +948,16 @@ pub struct Qwen3DecoderLayer {
 ### 显存预算 (BF16, 单卡 5090 32GB)

 ```
-模型权重:  7B × 2B = ~14 GB
-KV cache:  28 layers × 2(KV) × 8 heads × 4096 tokens × 128 dim × 2B ≈ 4.5 GB
+模型权重:  8B × 2B = ~16 GB
+KV cache:  36 layers × 2(KV) × 8 heads × 4096 tokens × 128 dim × 2B ≈ 5.6 GB
 Activation (单请求): ~1 GB
 ────────────────────────
-总计: ~19.5 GB (单请求)，剩余 ~12 GB 可用于更多并发
+总计: ~22.6 GB (单请求)，剩余 ~10 GB 可用于更多并发
 ```

 ### 测试验收

- [ ] 加载 Qwen3-7B 权重到单张 5090，打印模型结构和参数量
+- [ ] 加载 Qwen3-8B 权重到单张 5090，打印模型结构和参数量
 - [ ] Prefill logits 与 HF transformers 对比: 输入 "你好" → top-5 logits 一致
 - [ ] 英文生成: "What is the capital of France?" → 生成合理回答
 - [ ] 中文生成: "请介绍一下量子计算" → 生成通顺中文
@@ -1196,7 +1196,7 @@ GET  /health                 # 健康检查
 **Chat Completion Request**:
 ```json
 {
-  "model": "qwen3-7b",
+  "model": "qwen3-8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 1+1?"}
@@ -1211,13 +1211,13 @@ GET  /health                 # 健康检查

 **SSE Streaming Response**:
 ```
-data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

-data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-7b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

-data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-7b","choices":[{"index":0,"delta":{"content":" answer"},"finish_reason":null}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":" answer"},"finish_reason":null}]}

-data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-7b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

 data: [DONE]
 ```
@@ -1228,7 +1228,7 @@ data: [DONE]
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
-  "model": "qwen3-7b",
+  "model": "qwen3-8b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The answer is 2."},
@@ -1278,7 +1278,7 @@ Client (curl / Python OpenAI SDK)
  ```bash
  curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
-    -d '{"model":"qwen3-7b","messages":[{"role":"user","content":"Hello"}],"stream":true}'
+    -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Hello"}],"stream":true}'
  ```
  看到 SSE 逐 token 输出

@@ -1287,7 +1287,7 @@ Client (curl / Python OpenAI SDK)
  from openai import OpenAI
  client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
  for chunk in client.chat.completions.create(
-      model="qwen3-7b",
+      model="qwen3-8b",
      messages=[{"role": "user", "content": "What is 1+1?"}],
      stream=True
  ):
@@ -1302,12 +1302,26 @@ Client (curl / Python OpenAI SDK)

 ---

-## Phase 14: Flash Attention v2
+## Phase 14: Flash Attention (FA2 for SM120)

 **Crate**: `xserv-kernels`
 **CUDA 源码**: `csrc/attention/flash_attention.cu`

-**目标**: 实现 Flash Attention v2 的 CUDA kernel，大幅降低 attention 的显存占用并提升速度。
+**目标**: 实现 Flash Attention 的 CUDA kernel，大幅降低 attention 的显存占用并提升速度。
+
+### 硬件适配说明
+
+Flash Attention 已发展到第 4 代 (FA4, arxiv 2603.05451)，但各版本有明确的硬件依赖:
+
+| 版本 | 目标架构 | 关键硬件特性 | RTX 5090 兼容 |
+|------|---------|------------|--------------|
+| FA2 | 通用 CUDA (SM75+) | 标准 shared memory + HMMA | **是** ✅ |
+| FA3 | Hopper SM90 (H100) | TMA + WGMMA + warp specialization | 否 |
+| FA4 | Blackwell SM100 (B200/B300) | TMEM + async MMA + 2-CTA mode | 否 |
+
+**RTX 5090 (SM120, CC 12.0) 使用的是消费级 Blackwell 架构 (GB202)，与数据中心 Blackwell (B200, SM100) 是不同的硅片设计。SM120 物理上没有 TMEM (Tensor Memory) 子系统，因此 FA4 的 kernel 无法在 5090 上运行。这不是软件限制，是硬件级差异。**
+
+因此本项目实现 **FA2 算法**，使用标准 CUDA (shared memory + HMMA)。FA2 的核心优化——online softmax tiling、O(1) 显存占用——在任何架构上都有效。

 ### 核心思想

@@ -1323,16 +1337,18 @@ Flash Attention 的解法:
 - 将 Q, K, V 分成 tiles，在 SRAM (shared memory) 中计算
 - 使用 **online softmax trick**: 边算边更新 running max 和 running sum

-### 算法 (Forward Pass)
+### 算法 (Forward Pass, FA2)
+
+FA2 相比 FA1 的改进: 外层循环遍历 Q tiles (而非 K/V)，减少 HBM 读写次数。

 ```
 Br, Bc = tile sizes for Q and K/V respectively

-for each Q tile (q_start..q_start+Br):
+for each Q tile (q_start..q_start+Br):                    ← 外层: Q tiles
    load Q_tile [Br, D] to shared memory
-    initialize: O_tile = 0, l = 0, m = -inf    // running sum and max
+    initialize: O_tile = 0, l = 0, m = -inf               // running sum and max

-    for each K,V tile (kv_start..kv_start+Bc):
+    for each K,V tile (kv_start..kv_start+Bc):             ← 内层: K/V tiles
        load K_tile [Bc, D], V_tile [Bc, D] to shared memory

        // Compute attention scores for this tile pair
@@ -1345,6 +1361,8 @@ for each Q tile (q_start..q_start+Br):
        m_new = max(m, rowmax(S_tile))          // new running max
        P_tile = exp(S_tile - m_new)            // safe exp
        l_new = exp(m - m_new) * l + rowsum(P_tile)  // update running sum
+
+        // Rescale and accumulate output
        O_tile = diag(exp(m - m_new)) * O_tile + P_tile @ V_tile
        m = m_new
        l = l_new
@@ -1356,9 +1374,12 @@ for each Q tile (q_start..q_start+Br):
 ### 实现要点

 1. **Tile 大小选择**:
-   - 受限于 shared memory (5090 Blackwell CC 12.0: 需要实测确认 per-SM shared memory 上限)
-   - 需要同时存 Q_tile, K_tile, V_tile, S_tile
-   - 典型值: Br=Bc=128 for D=128, BF16
+   - 5090 SM120: shared memory per SM = 100 KB (需实测确认)
+   - 需同时存 Q_tile, K_tile, V_tile, S_tile
+   - BF16: Q_tile [Br, D] = Br × 128 × 2B; K_tile [Bc, D] = Bc × 128 × 2B
+   - S_tile [Br, Bc] 保持 FP32 = Br × Bc × 4B
+   - 推荐起步: Br=Bc=64, head_dim=128 → 共需 ~100KB shared memory
+   - 优化版: Br=Bc=128 需要更多 shared memory, 可能需要拆分

 2. **Causal mask 优化**:
   - 如果 K/V tile 完全在 Q tile 的"未来"（kv_start > q_end）→ 跳过整个 tile
@@ -1369,10 +1390,14 @@ for each Q tile (q_start..q_start+Br):
   - Q, K, V 的加载用 BF16（节省 bandwidth）
   - 最终 O 转回 BF16 写出

-4. **与 Paged Attention 的结合**:
-   - Flash Attention 的 K/V tile 遍历逻辑需要适配间接寻址
-   - 每个 tile 查 block_table 得到物理地址
-   - 这是 "Flash-Decoding" / "FlashInfer" 的核心
+4. **GQA 支持**:
+   - K/V heads 数量 < Q heads 时，kernel 中做 `kv_head = q_head / num_groups` 索引
+   - 不需要 repeat_kv 操作，直接在 kernel 内部解决
+
+5. **Decode attention 特化**:
+   - Decode 时 Q 只有 1 行 (Br=1)，退化为 vector-matrix attention
+   - 可以写一个专门的 decode attention kernel (类似 FlashDecoding)
+   - 沿 KV sequence 维度做 parallel reduction

 ### 测试验收

@@ -1386,8 +1411,9 @@ for each Q tile (q_start..q_start+Br):
 | 8192 | OOM? | MB | OOM? | ms |
 | 32768 | OOM | MB | OOM | ms |

- [ ] 集成到 Qwen3-7B，端到端 decode latency 对比
+- [ ] 集成到 Qwen3-8B，端到端 decode latency 对比
 - [ ] Profile: `ncu` 分析 compute utilization, memory throughput
+- [ ] GQA 支持: 无 repeat_kv 开销

 ---

@@ -1441,7 +1467,7 @@ ncu --target-processes all --set full ./target/release/xserv-server

 ### 测试验收

- [ ] 安装 vLLM，同一台机器跑 Qwen3-7B
+- [ ] 安装 vLLM，同一台机器跑 Qwen3-8B
 - [ ] Benchmark 对比:

 | Metric | vLLM | xserv | Ratio |
@@ -1488,7 +1514,7 @@ ncu --target-processes all --set full ./target/release/xserv-server

 - **无损**: rejection sampling 保证输出分布与纯 target model 一致
 - **加速条件**: draft model 足够快且与 target 分布接近
- **Draft model 选择**: Qwen3-0.5B / Qwen3-1.5B 作为 Qwen3-7B 的 draft
+- **Draft model 选择**: Qwen3-0.5B / Qwen3-1.5B 作为 Qwen3-8B 的 draft

 ### KV Cache 处理

@@ -1578,7 +1604,7 @@ Row Parallel: down_proj 按行切分

 ### 测试验收

- [ ] TP=2: Qwen3-7B 输出与单卡 (TP=1) 完全一致
+- [ ] TP=2: Qwen3-8B 输出与单卡 (TP=1) 完全一致
 - [ ] TP=4: 每卡权重显存占用约 1/4
 - [ ] Scaling benchmark (同组 GPU 0-3):

@@ -1646,7 +1672,7 @@ tensor_fp8 = cast_to_fp8(tensor / scale)
 | FP8 E4M3 | X.XX | +0.XX |
 | INT8 weight-only | X.XX | +0.XX |

- [ ] 显存: FP8 权重占用约 BF16 的一半 (~7 GB for 7B model)
+- [ ] 显存: FP8 权重占用约 BF16 的一半 (~8 GB for 8B model)
 - [ ] 性能: FP8 GEMM throughput vs BF16 GEMM

 ---
@@ -1727,7 +1753,7 @@ Text  → Tokenizer   → Text Tokens   ────────────→
 | 里程碑 | Phase | 验收标准 |
 |--------|-------|---------|
 | ① GPT-2 推理 | 8 | CLI 输入 prompt, GPT-2 生成连贯文本, logits 与 PyTorch 一致 |
-| ② Qwen3-7B 推理 | 10 | 7B 模型中英文对话, 多轮 chat template 正确 |
+| ② Qwen3-8B 推理 | 10 | 8B 模型中英文对话, 多轮 chat template 正确 |
 | ③ E2E API | 13 | HTTP streaming API, Python OpenAI SDK 可调用, 10 并发正确 |
 | ④ 性能达标 | 15 | throughput >= 50% vLLM, profiling 报告完成 |
 | ⑤ 多卡推理 | 17 | TP=2/4 同组 GPU 推理正确, scaling benchmark 完成 |
--- a/docs/10-qwen3.md
+++ b/docs/10-qwen3.md
@@ -1,12 +1,12 @@
-# Phase 10: Qwen3-7B Support — Design Document (Milestone ②)
+# Phase 10: Qwen3-8B Support — Design Document (Milestone ②)

 ## Goal

-扩展模型定义支持 Qwen3-7B 架构，验证输出正确性。与 GPT-2 的关键差异：RMSNorm、RoPE、GQA、SwiGLU、不共享 embedding。
+扩展模型定义支持 Qwen3-8B 架构，验证输出正确性。与 GPT-2 的关键差异：RMSNorm、RoPE、GQA、SwiGLU、不共享 embedding。

 ## 架构差异 (GPT-2 → Qwen3)

-| 特性 | GPT-2 | Qwen3-7B |
+| 特性 | GPT-2 | Qwen3-8B |
 |------|-------|----------|
 | Norm | LayerNorm(gamma, beta) | RMSNorm(gamma only) |
 | Position | Learned absolute (wpe) | RoPE (no params) |
@@ -15,8 +15,8 @@
 | FFN | 2 Linear (fc, proj) + GELU | 3 Linear (gate, up, down) + SwiGLU |
 | Weight layout | [in, out] (Conv1D style) | [out, in] (standard Linear) |
 | Tied embeddings | Yes | No (separate lm_head) |
-| hidden_size | 768 | 3584 |
-| num_layers | 12 | 28 |
+| hidden_size | 768 | 4096 |
+| num_layers | 12 | 36 |
 | head_dim | 64 | 128 |

 ## Weight Names (HuggingFace)
@@ -67,17 +67,17 @@ out  = down_proj(out)    # [S, 18944] @ [18944, 3584]^T → [S, 3584]
 ## 显存预算 (BF16, 单卡 5090)

 ```
-权重: 7B × 2B = ~14 GB (BF16)
-        7B × 4B = ~28 GB (FP32) — 不够! 必须用 BF16
+权重: 8B × 2B = ~16 GB (BF16)
+        8B × 4B = ~32 GB (FP32) — 不够! 必须用 BF16
 KV cache (S=256, B=1): ~0.1 GB
-总计: ~14 GB (BF16), 单卡可运行
+总计: ~16 GB (BF16), 单卡可运行
 ```

-**关键**: Qwen3-7B 必须用 BF16 才能在单张 5090 (32GB) 上运行。当前 GPT-2 用 FP32，需要支持 BF16 forward pass。
+**关键**: Qwen3-8B 必须用 BF16 才能在单张 5090 (32GB) 上运行。当前 GPT-2 用 FP32，需要支持 BF16 forward pass。

 ## Implementation Plan

-1. 下载 Qwen3-7B 模型 (BF16, ~14GB)
+1. 下载 Qwen3-8B 模型 (BF16, ~14GB)
 2. 实现 Qwen3 模型结构 (qwen3.rs)
 3. 支持 BF16 forward pass (linear_transpose for [out, in] weights)
 4. 实现 GQA (K/V repeat in split)
--- a/docs/TO-BE-FIXED.md
+++ b/docs/TO-BE-FIXED.md
@@ -0,0 +1,287 @@
+# xserv — To Be Fixed
+
+> 由最严格审查产出的修复清单。每项修复有明确验收标准，禁止 reward hacking。
+> 优先级: P0 (阻塞可用性) > P1 (严重bug/性能) > P2 (重要改进) > P3 (设计债务)
+
+---
+
+## FIX-01: 全局 cuBLAS handle，消除 per-call 创建 [P0-性能]
+
+**问题**: `gemm.rs` 中每次 `matmul` / `batched_matmul` 调用都 `cublasCreate_v2` + `cublasDestroy_v2`。Qwen3-8B 一次 forward 约 168 次 matmul，每次创建/销毁 handle 耗费数毫秒。
+
+**修复要求**:
+- 使用 thread-local 或全局单例 cuBLAS handle
+- handle 生命周期覆盖整个进程，不在 matmul 内创建/销毁
+- `CublasContext` 支持 `set_stream` 切换 stream
+
+**验收标准**:
+1. `grep -rn "cublasCreate_v2" crates/xserv-kernels/src/gemm.rs` 只出现 1 次（初始化处）
+2. `matmul` 和 `batched_matmul` 函数体内不再有 `CublasContext::new()`
+3. 编译通过，现有 gemm_test 全部通过
+
+---
+
+## FIX-02: 移除不必要的 cudaDeviceSynchronize [P0-性能]
+
+**问题**: 几乎每个 kernel wrapper 结尾都有 `xserv_cuda::device::synchronize()`（即 `cudaDeviceSynchronize`），完全杀死 GPU pipeline。
+
+**修复要求**:
+- 删除所有 kernel wrapper 中的 `device::synchronize()` 调用
+- 仅在需要读回 GPU 数据到 CPU 时同步（如 `sample_greedy`, `to_device(Cpu)`, benchmark）
+- 在 `Tensor::to_device(Cpu)` 路径中已有隐式同步（`cudaMemcpy` 是同步的），不需要额外 sync
+- 如果 kernel 使用 null stream（默认 stream），`cudaMemcpy` 会隐式等待默认 stream 上的所有操作
+
+**验收标准**:
+1. `grep -rn "device::synchronize" crates/xserv-kernels/src/` 返回 0 行
+2. `grep -rn "device::synchronize" crates/xserv-model/src/` 只出现在 benchmark binary 中，不在 forward path 中
+3. 编译通过，现有测试全部通过
+4. 模型推理结果与修复前 bit-exact 一致（greedy decode 相同 prompt 产生相同 token 序列）
+
+---
+
+## FIX-03: 修复 Chat Template [P0-功能]
+
+**问题**: `api.rs` 的 `build_prompt` 只是简单拼接文本，没有 ChatML special tokens。Qwen3 模型收到的 prompt 没有对话结构。
+
+**修复要求**:
+- 生成符合 Qwen3 ChatML 格式的 prompt：
+  ```
+  <|im_start|>system\n{content}<|im_end|>\n<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n
+  ```
+- 如果没有 system message，跳过 system 部分
+- 如果有多轮 assistant/user 交替，按顺序生成
+- 结尾始终是 `<|im_start|>assistant\n`（让模型生成 assistant 回复）
+
+**验收标准**:
+1. 单元测试: 给定 `[{role: "user", content: "Hello"}]`，生成的 prompt 字符串包含 `<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n`
+2. 单元测试: 给定 system + user + assistant + user 四条消息，格式正确
+3. 编译通过
+
+---
+
+## FIX-04: 修复 `is_finished` 硬编码 EOS [P0-功能]
+
+**问题**: `engine.rs:160` 硬编码 `last == 151645` 作为 EOS 判断。
+
+**修复要求**:
+- `Sequence` struct 增加 `eos_token_id: Option<u32>` 字段
+- 在 `make_sequence` 中从 tokenizer 获取 EOS token ID
+- `is_finished` 使用该字段判断
+
+**验收标准**:
+1. `grep -rn "151645" crates/xserv-server/` 返回 0 行
+2. `is_finished` 函数不包含任何硬编码 token ID
+3. 编译通过
+
+---
+
+## FIX-05: 修复 `Storage::device()` 丢失设备信息 [P1-Bug]
+
+**问题**: `storage.rs:43` 对所有 GPU storage 返回 `Device::Cuda(0)`，不追踪实际设备。
+
+**修复要求**:
+- `StorageInner::Cuda` 增加 `device: u32` 字段
+- `Storage::cuda()` 接受 device 参数，或从 `GpuBuffer` 推断
+- `Storage::device()` 返回实际设备
+- 所有创建 `Storage::cuda()` 的调用点更新
+
+**验收标准**:
+1. 创建一个 `Device::Cuda(3)` 的 tensor，`tensor.device()` 返回 `Device::Cuda(3)`
+2. 编译通过，现有测试通过
+
+---
+
+## FIX-06: 修复 `unsqueeze` stride 计算 [P1-Bug]
+
+**问题**: `tensor.rs:128` 中 unsqueeze 的 stride 计算错误。对 `[3,4]` strides `[4,1]` 做 `unsqueeze(0)` 得到 strides `[4,4,1]`，而正确应为 `[12,4,1]`。虽然 size-1 维度的 stride 不影响寻址，但导致 `is_contiguous()` 误判为 false，触发不必要的 copy。
+
+**修复要求**:
+- size-1 维度的 stride 应设为 `shape[dim+1] * strides[dim+1]`（如果 dim 不是最后一维），使其满足 contiguous 条件
+- 或者更简单: unsqueeze 后如果原 tensor 是 contiguous 的，直接重算 contiguous strides
+
+**验收标准**:
+1. 单元测试: `[3,4]` contiguous tensor 做 `unsqueeze(0)` 后 `is_contiguous()` 返回 true
+2. 单元测试: `[3,4]` contiguous tensor 做 `unsqueeze(1)` 后 `is_contiguous()` 返回 true
+3. 单元测试: `[3,4]` contiguous tensor 做 `unsqueeze(2)` 后 `is_contiguous()` 返回 true
+4. 编译通过，现有测试通过
+
+---
+
+## FIX-07: 使用 Caching Allocator [P1-性能]
+
+**问题**: `CachingAllocator` 已实现但从未使用。所有 GPU 分配直接 `cudaMalloc`。
+
+**修复要求**:
+- 创建一个全局或 thread-local `CachingAllocator` 实例
+- `Tensor::zeros` 等分配路径通过 caching allocator
+- 或者至少: `GpuKVCache::get_kv_len` 中的临时 buffer 分配通过 caching allocator（这是最热的分配路径）
+- `GpuBuffer::Drop` 需要与 allocator 配合（return to pool 而非 cudaFree）
+
+**验收标准**:
+1. 在 decode loop 中连续调用 `get_kv_len` 100 次，`AllocStats.cuda_malloc_count` < 10（大部分命中 cache）
+2. 编译通过，现有测试通过
+
+---
+
+## FIX-08: 修复 `CudaDeviceProp` FFI 安全性 [P1-Bug]
+
+**问题**: `ffi.rs:31` 使用 `_pad: [u8; 4096]` 假设 cudaDeviceProp 总大小。CUDA 12.9 的实际结构可能更大。
+
+**修复要求**:
+- 删除 `CudaDeviceProp` struct（或仅保留 name 字段所需的最小 struct）
+- 如果只需要 name: 分配一个足够大的 buffer（如 `[u8; 8192]`）并直接读取 name offset（前 256 bytes）
+- 或者更安全: 使用 `cudaDeviceGetAttribute` + 单独的 name 查询 API（`device.rs` 已经用 getAttribute 查其他属性了，只差 name）
+
+**验收标准**:
+1. 不再有 `CudaDeviceProp` struct，或 padding 大小基于 `std::mem::size_of` 动态确定
+2. `device_info()` 仍能返回正确的 device name
+3. 编译通过，现有测试通过
+
+---
+
+## FIX-09: 修复 Tokenizer byte_fallback panic [P1-Bug]
+
+**问题**: `bpe.rs:173-176` 中 Qwen3 tokenizer 遇到不在 vocab 的单字节时 panic。
+
+**修复要求**:
+- 当 `byte_fallback == true` 且单字节不在 vocab 时，查找 `<0xNN>` 格式的 special token
+- 如果 `<0xNN>` 也不存在，才 panic（带有明确的错误信息）
+
+**验收标准**:
+1. 使用 Qwen3 tokenizer encode 包含所有 256 个字节值的字符串不 panic
+2. encode 后 decode 回来的字节序列与原始一致
+3. 编译通过
+
+---
+
+## FIX-10: 实现 SSE Streaming [P2-功能]
+
+**问题**: API 只支持阻塞式响应，不支持 SSE streaming。
+
+**修复要求**:
+- `ChatRequest` 增加 `stream: Option<bool>` 字段
+- 当 `stream == true` 时，返回 `text/event-stream` content type
+- 每生成一个 token 发送一个 SSE event，格式与 OpenAI 兼容:
+  ```
+  data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"token"},"finish_reason":null}]}
+  ```
+- 最后发送 `data: [DONE]`
+- 非 streaming 模式行为不变
+
+**验收标准**:
+1. `curl` 请求 `stream: true` 能看到逐行 SSE 输出
+2. 每行 SSE data 是合法 JSON，包含 `choices[0].delta.content`
+3. 最后一行是 `data: [DONE]`
+4. 非 streaming 请求仍正常工作
+5. 编译通过
+
+---
+
+## FIX-11: 修复 Usage 统计 [P2-功能]
+
+**问题**: API 返回的 usage 全是 0。
+
+**修复要求**:
+- 追踪 prompt token 数量和 completion token 数量
+- 在 non-streaming 响应中返回正确的 usage
+- 在 streaming 最后一个 chunk（或 `[DONE]` 前）可选择性包含 usage
+
+**验收标准**:
+1. 发送一个 non-streaming 请求，`usage.prompt_tokens` > 0，`usage.completion_tokens` > 0
+2. `usage.total_tokens == usage.prompt_tokens + usage.completion_tokens`
+3. 编译通过
+
+---
+
+## FIX-12: `GpuKVCache::get_kv_len` 避免重复分配 [P2-性能]
+
+**问题**: 每次调用 `get_kv_len` 都 `GpuBuffer::alloc` 新内存，decode 循环中每步每层一次。
+
+**修复要求**:
+- 方案 A: 返回 view/slice 到已有的预分配 buffer（零分配），需要构造 Tensor 时使用正确的 strides 指向 padded buffer
+- 方案 B: 在 GpuKVCache 中预分配 output buffer，get_kv_len 做 D2D copy 到固定 buffer（每层 2 个 output buffer）
+- 方案 A 更优但实现复杂度更高
+
+**验收标准**:
+1. 连续调用 `get_kv_len` 100 次，`cudaMalloc` 调用次数 <= 2（初始分配）
+2. 返回的 tensor 数据正确（与修改前 bit-exact）
+3. 编译通过，现有测试通过
+
+---
+
+## FIX-13: 实现 Sampling Strategies [P2-功能]
+
+**问题**: 只有 greedy sampling，没有 temperature / top-k / top-p。
+
+**修复要求**:
+- 实现 `SamplingParams { temperature, top_k, top_p }` struct
+- temperature: `logits = logits / temperature` 后 softmax 后按概率采样
+- top_k: 保留 top-k logits，其余置 -inf
+- top_p: 按概率降序累加到 >= p 后截断
+- greedy 作为 `temperature = 0` 或独立模式
+- `GenerateRequest` 接收 sampling params
+- API 层解析 temperature / top_k / top_p 参数
+
+**验收标准**:
+1. temperature=0.0 与 greedy 结果一致
+2. temperature=1.0 多次生成同一 prompt 产生不同结果
+3. top_k=1 与 greedy 结果一致
+4. 编译通过
+
+---
+
+## FIX-14: GPU Tensor contiguous() 用 GPU kernel [P2-性能]
+
+**问题**: `tensor.rs:148` 中非 contiguous GPU tensor 做 contiguous 需要 GPU→CPU→CPU copy→CPU→GPU。
+
+**修复要求**:
+- 实现一个通用的 strided copy GPU kernel（或至少对常见的 transpose 情况有 kernel）
+- `contiguous()` 对 GPU tensor 直接在 GPU 上完成
+
+**验收标准**:
+1. 对一个 GPU 上的 transposed tensor 调用 `contiguous()`，不触发任何 `cudaMemcpy` H2D/D2H
+2. 结果与 CPU 实现 bit-exact
+3. 编译通过，现有测试通过
+
+---
+
+## FIX-15: GPT-2 消除 CPU round-trip (split_qkv, merge_heads, add_bias) [P3-性能]
+
+**问题**: GPT-2 的 `split_qkv`, `merge_heads`, `add_bias` 全在 CPU 上做。
+
+**修复要求**:
+- `add_bias`: 实现 broadcast-add GPU kernel（[S,N] + [N] → [S,N]）
+- `split_qkv`: 实现 GPU kernel 将 [S, 3H] 分成 Q/K/V 并 reshape 为 [1, heads, S, D]
+- `merge_heads`: 复用已有的 `merge_heads_gpu` kernel（目前只有 BF16 版本，需要 F32 版本）
+
+**验收标准**:
+1. GPT-2 forward path 中 `grep -n "to_device(Device::Cpu)"` 只出现在 `sample_greedy` 中
+2. 推理结果与修复前一致（greedy decode bit-exact）
+3. 编译通过，现有测试通过
+
+---
+
+## 修复优先级排序
+
+**第一批 (必须先做，其他依赖它们)**:
+1. FIX-01: 全局 cuBLAS handle
+2. FIX-02: 移除 device sync
+3. FIX-03: Chat template
+4. FIX-04: is_finished EOS
+
+**第二批 (重要 bug 修复)**:
+5. FIX-05: Storage device tracking
+6. FIX-06: unsqueeze stride
+7. FIX-08: CudaDeviceProp
+8. FIX-09: byte_fallback panic
+
+**第三批 (功能完善)**:
+9. FIX-10: SSE streaming
+10. FIX-11: Usage stats
+11. FIX-13: Sampling strategies
+
+**第四批 (性能优化)**:
+12. FIX-07: Caching allocator
+13. FIX-12: KV cache alloc
+14. FIX-14: GPU contiguous
+15. FIX-15: GPT-2 CPU round-trip