Files

Gahow Wang ee68d3565d fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).

Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)

Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams

Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)

Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
  for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)

Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:53:28 +08:00

11 KiB

Raw Blame History

xserv — To Be Fixed

由最严格审查产出的修复清单。每项修复有明确验收标准，禁止 reward hacking。优先级: P0 (阻塞可用性) > P1 (严重bug/性能) > P2 (重要改进) > P3 (设计债务)

FIX-01: 全局 cuBLAS handle，消除 per-call 创建 [P0-性能]

问题: gemm.rs 中每次 matmul / batched_matmul 调用都 cublasCreate_v2 + cublasDestroy_v2。Qwen3-8B 一次 forward 约 168 次 matmul，每次创建/销毁 handle 耗费数毫秒。

修复要求:

使用 thread-local 或全局单例 cuBLAS handle
handle 生命周期覆盖整个进程，不在 matmul 内创建/销毁
CublasContext 支持 set_stream 切换 stream

验收标准:

grep -rn "cublasCreate_v2" crates/xserv-kernels/src/gemm.rs 只出现 1 次（初始化处）
matmul 和 batched_matmul 函数体内不再有 CublasContext::new()
编译通过，现有 gemm_test 全部通过

FIX-02: 移除不必要的 cudaDeviceSynchronize [P0-性能]

问题: 几乎每个 kernel wrapper 结尾都有 xserv_cuda::device::synchronize()（即 cudaDeviceSynchronize），完全杀死 GPU pipeline。

修复要求:

删除所有 kernel wrapper 中的 device::synchronize() 调用
仅在需要读回 GPU 数据到 CPU 时同步（如 sample_greedy, to_device(Cpu), benchmark）
在 Tensor::to_device(Cpu) 路径中已有隐式同步（cudaMemcpy 是同步的），不需要额外 sync
如果 kernel 使用 null stream（默认 stream），cudaMemcpy 会隐式等待默认 stream 上的所有操作

验收标准:

grep -rn "device::synchronize" crates/xserv-kernels/src/ 返回 0 行
grep -rn "device::synchronize" crates/xserv-model/src/ 只出现在 benchmark binary 中，不在 forward path 中
编译通过，现有测试全部通过
模型推理结果与修复前 bit-exact 一致（greedy decode 相同 prompt 产生相同 token 序列）

FIX-03: 修复 Chat Template [P0-功能]

问题: api.rs 的 build_prompt 只是简单拼接文本，没有 ChatML special tokens。Qwen3 模型收到的 prompt 没有对话结构。

修复要求:

生成符合 Qwen3 ChatML 格式的 prompt：

<|im_start|>system\n{content}<|im_end|>\n<|im_start|>user\n{content}<|im_end|>\n<|im_start|>assistant\n

如果没有 system message，跳过 system 部分
如果有多轮 assistant/user 交替，按顺序生成
结尾始终是 <|im_start|>assistant\n（让模型生成 assistant 回复）

验收标准:

单元测试: 给定 [{role: "user", content: "Hello"}]，生成的 prompt 字符串包含 <|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n
单元测试: 给定 system + user + assistant + user 四条消息，格式正确
编译通过

FIX-04: 修复 `is_finished` 硬编码 EOS [P0-功能]

问题: engine.rs:160 硬编码 last == 151645 作为 EOS 判断。

修复要求:

Sequence struct 增加 eos_token_id: Option<u32> 字段
在 make_sequence 中从 tokenizer 获取 EOS token ID
is_finished 使用该字段判断

验收标准:

grep -rn "151645" crates/xserv-server/ 返回 0 行
is_finished 函数不包含任何硬编码 token ID
编译通过

FIX-05: 修复 `Storage::device()` 丢失设备信息 [P1-Bug]

问题: storage.rs:43 对所有 GPU storage 返回 Device::Cuda(0)，不追踪实际设备。

修复要求:

StorageInner::Cuda 增加 device: u32 字段
Storage::cuda() 接受 device 参数，或从 GpuBuffer 推断
Storage::device() 返回实际设备
所有创建 Storage::cuda() 的调用点更新

验收标准:

创建一个 Device::Cuda(3) 的 tensor，tensor.device() 返回 Device::Cuda(3)
编译通过，现有测试通过

FIX-06: 修复 `unsqueeze` stride 计算 [P1-Bug]

问题: tensor.rs:128 中 unsqueeze 的 stride 计算错误。对 [3,4] strides [4,1] 做 unsqueeze(0) 得到 strides [4,4,1]，而正确应为 [12,4,1]。虽然 size-1 维度的 stride 不影响寻址，但导致 is_contiguous() 误判为 false，触发不必要的 copy。

修复要求:

size-1 维度的 stride 应设为 shape[dim+1] * strides[dim+1]（如果 dim 不是最后一维），使其满足 contiguous 条件
或者更简单: unsqueeze 后如果原 tensor 是 contiguous 的，直接重算 contiguous strides

验收标准:

单元测试: [3,4] contiguous tensor 做 unsqueeze(0) 后 is_contiguous() 返回 true
单元测试: [3,4] contiguous tensor 做 unsqueeze(1) 后 is_contiguous() 返回 true
单元测试: [3,4] contiguous tensor 做 unsqueeze(2) 后 is_contiguous() 返回 true
编译通过，现有测试通过

FIX-07: 使用 Caching Allocator [P1-性能]

问题: CachingAllocator 已实现但从未使用。所有 GPU 分配直接 cudaMalloc。

修复要求:

创建一个全局或 thread-local CachingAllocator 实例
Tensor::zeros 等分配路径通过 caching allocator
或者至少: GpuKVCache::get_kv_len 中的临时 buffer 分配通过 caching allocator（这是最热的分配路径）
GpuBuffer::Drop 需要与 allocator 配合（return to pool 而非 cudaFree）

验收标准:

在 decode loop 中连续调用 get_kv_len 100 次，AllocStats.cuda_malloc_count < 10（大部分命中 cache）
编译通过，现有测试通过

FIX-08: 修复 `CudaDeviceProp` FFI 安全性 [P1-Bug]

问题: ffi.rs:31 使用 _pad: [u8; 4096] 假设 cudaDeviceProp 总大小。CUDA 12.9 的实际结构可能更大。

修复要求:

删除 CudaDeviceProp struct（或仅保留 name 字段所需的最小 struct）
如果只需要 name: 分配一个足够大的 buffer（如 [u8; 8192]）并直接读取 name offset（前 256 bytes）
或者更安全: 使用 cudaDeviceGetAttribute + 单独的 name 查询 API（device.rs 已经用 getAttribute 查其他属性了，只差 name）

验收标准:

不再有 CudaDeviceProp struct，或 padding 大小基于 std::mem::size_of 动态确定
device_info() 仍能返回正确的 device name
编译通过，现有测试通过

FIX-09: 修复 Tokenizer byte_fallback panic [P1-Bug]

问题: bpe.rs:173-176 中 Qwen3 tokenizer 遇到不在 vocab 的单字节时 panic。

修复要求:

当 byte_fallback == true 且单字节不在 vocab 时，查找 <0xNN> 格式的 special token
如果 <0xNN> 也不存在，才 panic（带有明确的错误信息）

验收标准:

使用 Qwen3 tokenizer encode 包含所有 256 个字节值的字符串不 panic
encode 后 decode 回来的字节序列与原始一致
编译通过

FIX-10: 实现 SSE Streaming [P2-功能]

问题: API 只支持阻塞式响应，不支持 SSE streaming。

修复要求:

ChatRequest 增加 stream: Option<bool> 字段
当 stream == true 时，返回 text/event-stream content type

每生成一个 token 发送一个 SSE event，格式与 OpenAI 兼容:

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"token"},"finish_reason":null}]}

最后发送 data: [DONE]
非 streaming 模式行为不变

验收标准:

curl 请求 stream: true 能看到逐行 SSE 输出
每行 SSE data 是合法 JSON，包含 choices[0].delta.content
最后一行是 data: [DONE]
非 streaming 请求仍正常工作
编译通过

FIX-11: 修复 Usage 统计 [P2-功能]

问题: API 返回的 usage 全是 0。

修复要求:

追踪 prompt token 数量和 completion token 数量
在 non-streaming 响应中返回正确的 usage
在 streaming 最后一个 chunk（或 [DONE] 前）可选择性包含 usage

验收标准:

发送一个 non-streaming 请求，usage.prompt_tokens > 0，usage.completion_tokens > 0
usage.total_tokens == usage.prompt_tokens + usage.completion_tokens
编译通过

FIX-12: `GpuKVCache::get_kv_len` 避免重复分配 [P2-性能]

问题: 每次调用 get_kv_len 都 GpuBuffer::alloc 新内存，decode 循环中每步每层一次。

修复要求:

方案 A: 返回 view/slice 到已有的预分配 buffer（零分配），需要构造 Tensor 时使用正确的 strides 指向 padded buffer
方案 B: 在 GpuKVCache 中预分配 output buffer，get_kv_len 做 D2D copy 到固定 buffer（每层 2 个 output buffer）
方案 A 更优但实现复杂度更高

验收标准:

连续调用 get_kv_len 100 次，cudaMalloc 调用次数 <= 2（初始分配）
返回的 tensor 数据正确（与修改前 bit-exact）
编译通过，现有测试通过

FIX-13: 实现 Sampling Strategies [P2-功能]

问题: 只有 greedy sampling，没有 temperature / top-k / top-p。

修复要求:

实现 SamplingParams { temperature, top_k, top_p } struct
temperature: logits = logits / temperature 后 softmax 后按概率采样
top_k: 保留 top-k logits，其余置 -inf
top_p: 按概率降序累加到 >= p 后截断
greedy 作为 temperature = 0 或独立模式
GenerateRequest 接收 sampling params
API 层解析 temperature / top_k / top_p 参数

验收标准:

temperature=0.0 与 greedy 结果一致
temperature=1.0 多次生成同一 prompt 产生不同结果
top_k=1 与 greedy 结果一致
编译通过

FIX-14: GPU Tensor contiguous() 用 GPU kernel [P2-性能]

问题: tensor.rs:148 中非 contiguous GPU tensor 做 contiguous 需要 GPU→CPU→CPU copy→CPU→GPU。

修复要求:

实现一个通用的 strided copy GPU kernel（或至少对常见的 transpose 情况有 kernel）
contiguous() 对 GPU tensor 直接在 GPU 上完成

验收标准:

对一个 GPU 上的 transposed tensor 调用 contiguous()，不触发任何 cudaMemcpy H2D/D2H
结果与 CPU 实现 bit-exact
编译通过，现有测试通过

FIX-15: GPT-2 消除 CPU round-trip (split_qkv, merge_heads, add_bias) [P3-性能]

问题: GPT-2 的 split_qkv, merge_heads, add_bias 全在 CPU 上做。

修复要求:

add_bias: 实现 broadcast-add GPU kernel（[S,N] + [N] → [S,N]）
split_qkv: 实现 GPU kernel 将 [S, 3H] 分成 Q/K/V 并 reshape 为 [1, heads, S, D]
merge_heads: 复用已有的 merge_heads_gpu kernel（目前只有 BF16 版本，需要 F32 版本）

验收标准:

GPT-2 forward path 中 grep -n "to_device(Device::Cpu)" 只出现在 sample_greedy 中
推理结果与修复前一致（greedy decode bit-exact）
编译通过，现有测试通过

修复优先级排序

第一批 (必须先做，其他依赖它们):

FIX-01: 全局 cuBLAS handle
FIX-02: 移除 device sync
FIX-03: Chat template
FIX-04: is_finished EOS

第二批 (重要 bug 修复): 5. FIX-05: Storage device tracking 6. FIX-06: unsqueeze stride 7. FIX-08: CudaDeviceProp 8. FIX-09: byte_fallback panic

第三批 (功能完善): 9. FIX-10: SSE streaming 10. FIX-11: Usage stats 11. FIX-13: Sampling strategies

第四批 (性能优化): 12. FIX-07: Caching allocator 13. FIX-12: KV cache alloc 14. FIX-14: GPU contiguous 15. FIX-15: GPT-2 CPU round-trip

11 KiB Raw Blame History Unescape Escape

xserv — To Be Fixed

FIX-01: 全局 cuBLAS handle，消除 per-call 创建 [P0-性能]

FIX-02: 移除不必要的 cudaDeviceSynchronize [P0-性能]

FIX-03: 修复 Chat Template [P0-功能]

FIX-04: 修复 is_finished 硬编码 EOS [P0-功能]

FIX-05: 修复 Storage::device() 丢失设备信息 [P1-Bug]

FIX-06: 修复 unsqueeze stride 计算 [P1-Bug]

FIX-07: 使用 Caching Allocator [P1-性能]

FIX-08: 修复 CudaDeviceProp FFI 安全性 [P1-Bug]

FIX-09: 修复 Tokenizer byte_fallback panic [P1-Bug]

FIX-10: 实现 SSE Streaming [P2-功能]

FIX-11: 修复 Usage 统计 [P2-功能]

FIX-12: GpuKVCache::get_kv_len 避免重复分配 [P2-性能]

FIX-13: 实现 Sampling Strategies [P2-功能]

FIX-14: GPU Tensor contiguous() 用 GPU kernel [P2-性能]

FIX-15: GPT-2 消除 CPU round-trip (split_qkv, merge_heads, add_bias) [P3-性能]

修复优先级排序

11 KiB

Raw Blame History

FIX-04: 修复 `is_finished` 硬编码 EOS [P0-功能]

FIX-05: 修复 `Storage::device()` 丢失设备信息 [P1-Bug]

FIX-06: 修复 `unsqueeze` stride 计算 [P1-Bug]

FIX-08: 修复 `CudaDeviceProp` FFI 安全性 [P1-Bug]

FIX-12: `GpuKVCache::get_kv_len` 避免重复分配 [P2-性能]