fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:53:28 +08:00
parent d8493bd70f
commit ee68d3565d
38 changed files with 3012 additions and 259 deletions
--- a/docs/10-qwen3.md
+++ b/docs/10-qwen3.md
@@ -1,12 +1,12 @@
-# Phase 10: Qwen3-7B Support — Design Document (Milestone ②)
+# Phase 10: Qwen3-8B Support — Design Document (Milestone ②)

 ## Goal

-扩展模型定义支持 Qwen3-7B 架构，验证输出正确性。与 GPT-2 的关键差异：RMSNorm、RoPE、GQA、SwiGLU、不共享 embedding。
+扩展模型定义支持 Qwen3-8B 架构，验证输出正确性。与 GPT-2 的关键差异：RMSNorm、RoPE、GQA、SwiGLU、不共享 embedding。

 ## 架构差异 (GPT-2 → Qwen3)

-| 特性 | GPT-2 | Qwen3-7B |
+| 特性 | GPT-2 | Qwen3-8B |
 |------|-------|----------|
 | Norm | LayerNorm(gamma, beta) | RMSNorm(gamma only) |
 | Position | Learned absolute (wpe) | RoPE (no params) |
@@ -15,8 +15,8 @@
 | FFN | 2 Linear (fc, proj) + GELU | 3 Linear (gate, up, down) + SwiGLU |
 | Weight layout | [in, out] (Conv1D style) | [out, in] (standard Linear) |
 | Tied embeddings | Yes | No (separate lm_head) |
-| hidden_size | 768 | 3584 |
-| num_layers | 12 | 28 |
+| hidden_size | 768 | 4096 |
+| num_layers | 12 | 36 |
 | head_dim | 64 | 128 |

 ## Weight Names (HuggingFace)
@@ -67,17 +67,17 @@ out  = down_proj(out)    # [S, 18944] @ [18944, 3584]^T → [S, 3584]
 ## 显存预算 (BF16, 单卡 5090)

 ```
-权重: 7B × 2B = ~14 GB (BF16)
-        7B × 4B = ~28 GB (FP32) — 不够! 必须用 BF16
+权重: 8B × 2B = ~16 GB (BF16)
+        8B × 4B = ~32 GB (FP32) — 不够! 必须用 BF16
 KV cache (S=256, B=1): ~0.1 GB
-总计: ~14 GB (BF16), 单卡可运行
+总计: ~16 GB (BF16), 单卡可运行
 ```

-**关键**: Qwen3-7B 必须用 BF16 才能在单张 5090 (32GB) 上运行。当前 GPT-2 用 FP32，需要支持 BF16 forward pass。
+**关键**: Qwen3-8B 必须用 BF16 才能在单张 5090 (32GB) 上运行。当前 GPT-2 用 FP32，需要支持 BF16 forward pass。

 ## Implementation Plan

-1. 下载 Qwen3-7B 模型 (BF16, ~14GB)
+1. 下载 Qwen3-8B 模型 (BF16, ~14GB)
 2. 实现 Qwen3 模型结构 (qwen3.rs)
 3. 支持 BF16 forward pass (linear_transpose for [out, in] weights)
 4. 实现 GQA (K/V repeat in split)