xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	42e13f33dd	docs: Phase 24 investigation notes and revised speedup plan Attempted the simple win — replace matmul_rows_gemv with matmul_2d in forward_verify_paged_decode_attention — and it worked (0.44x -> 0.68x on 5 prompts) but produced matched=false. Root cause is K/V drift, not just logit rounding: matmul_2d at m=1 uses the custom GEMV path, at m>=2 it uses cuBLAS GEMM, and the two produce different BF16 bits. Verify then writes K/V with GEMM values while baseline decode would have written GEMV values, and every downstream position drifts. A near-tie fallback for the current row's logit does nothing to fix already-diverged history, so it was reverted in the same session. Docs/24 captures the finding and lays out the actual path forward: implement a launch_gemv_bf16_batched kernel that runs gamma m=1 GEMVs in a single launch with bit-identical output to gamma sequential calls, then add draft-side CUDA graph and adaptive gamma. Also includes a back-of-envelope that shows current acceptance rate 0.39 + verify=13ms lands close to 1.0x speedup even with verify made free; hitting speedup_e2e > 1 needs launch-overhead savings AND either higher acceptance or a cheaper draft. Reverts: none (Phase 24 attempts never landed on main). Only the doc.	2026-07-01 15:35:11 +08:00

Author

SHA1

Message

Date

Gahow Wang

42e13f33dd

docs: Phase 24 investigation notes and revised speedup plan

Attempted the simple win — replace matmul_rows_gemv with matmul_2d in
forward_verify_paged_decode_attention — and it worked (0.44x -> 0.68x
on 5 prompts) but produced matched=false. Root cause is K/V drift, not
just logit rounding: matmul_2d at m=1 uses the custom GEMV path, at
m>=2 it uses cuBLAS GEMM, and the two produce different BF16 bits.
Verify then writes K/V with GEMM values while baseline decode would
have written GEMV values, and every downstream position drifts.

A near-tie fallback for the current row's logit does nothing to fix
already-diverged history, so it was reverted in the same session.

Docs/24 captures the finding and lays out the actual path forward:
implement a launch_gemv_bf16_batched kernel that runs gamma m=1 GEMVs
in a single launch with bit-identical output to gamma sequential
calls, then add draft-side CUDA graph and adaptive gamma.

Also includes a back-of-envelope that shows current acceptance rate
0.39 + verify=13ms lands close to 1.0x speedup even with verify made
free; hitting speedup_e2e > 1 needs launch-overhead savings AND either
higher acceptance or a cheaper draft.

Reverts: none (Phase 24 attempts never landed on main). Only the doc.

2026-07-01 15:35:11 +08:00

1 Commits