xserv

Files

Gahow Wang e5734b41fa speculative: batched-GEMV kernel for verify path (Phase 24 step 1)

Add launch_gemv_bf16_batched: runs M m=1 GEMVs in a single 3D grid
launch (z = batch row) with numerically identical output to M sequential
launch_gemv_bf16 calls — same K-block partial accumulation, same
fixed-order reduction. Verified on dash5 with 10 prompts × 32 tokens:
matched=true, verify_decode_mismatches=0.

Expose as matmul_batched_gemv(a: [M,K], b: [K,N]) → [M,N] in
xserv-kernels. Replace the old matmul_rows_gemv helper in qwen3
forward_verify_paged_decode_attention; the per-row loop over matmul_2d +
concat_rows is replaced by a single matmul_batched_gemv call that
allocates the partials buffer in one shot and launches 2 kernels instead
of 2*M.

Current speedup_e2e is 0.47× (same ballpark as Phase 23 0.44×);
the batched launch saves ~3 ms overhead but this is small relative to
the total 28 ms spec cost. The path forward (per docs/24 §4) is
higher acceptance rate or cheaper draft, not further kernel optimization.

2026-07-01 16:13:37 +08:00

activation

gpt-oss: drop debug syncs from forward; GPU broadcast bias-add

2026-06-12 17:02:59 +08:00

attention

cuda: fix remaining int32-address and nondeterministic-reduction bugs

2026-07-01 15:13:07 +08:00

embedding

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

gemm

speculative: batched-GEMV kernel for verify path (Phase 24 step 1)