Files
xserv/csrc
Gahow Wang e5734b41fa speculative: batched-GEMV kernel for verify path (Phase 24 step 1)
Add launch_gemv_bf16_batched: runs M m=1 GEMVs in a single 3D grid
launch (z = batch row) with numerically identical output to M sequential
launch_gemv_bf16 calls — same K-block partial accumulation, same
fixed-order reduction. Verified on dash5 with 10 prompts × 32 tokens:
matched=true, verify_decode_mismatches=0.

Expose as matmul_batched_gemv(a: [M,K], b: [K,N]) → [M,N] in
xserv-kernels. Replace the old matmul_rows_gemv helper in qwen3
forward_verify_paged_decode_attention; the per-row loop over matmul_2d +
concat_rows is replaced by a single matmul_batched_gemv call that
allocates the partials buffer in one shot and launches 2 kernels instead
of 2*M.

Current speedup_e2e is 0.47× (same ballpark as Phase 23 0.44×);
the batched launch saves ~3 ms overhead but this is small relative to
the total 28 ms spec cost. The path forward (per docs/24 §4) is
higher acceptance rate or cheaper draft, not further kernel optimization.
2026-07-01 16:13:37 +08:00
..