Attempted the simple win — replace matmul_rows_gemv with matmul_2d in
forward_verify_paged_decode_attention — and it worked (0.44x -> 0.68x
on 5 prompts) but produced matched=false. Root cause is K/V drift, not
just logit rounding: matmul_2d at m=1 uses the custom GEMV path, at
m>=2 it uses cuBLAS GEMM, and the two produce different BF16 bits.
Verify then writes K/V with GEMM values while baseline decode would
have written GEMV values, and every downstream position drifts.
A near-tie fallback for the current row's logit does nothing to fix
already-diverged history, so it was reverted in the same session.
Docs/24 captures the finding and lays out the actual path forward:
implement a launch_gemv_bf16_batched kernel that runs gamma m=1 GEMVs
in a single launch with bit-identical output to gamma sequential
calls, then add draft-side CUDA graph and adaptive gamma.
Also includes a back-of-envelope that shows current acceptance rate
0.39 + verify=13ms lands close to 1.0x speedup even with verify made
free; hitting speedup_e2e > 1 needs launch-overhead savings AND either
higher acceptance or a cheaper draft.
Reverts: none (Phase 24 attempts never landed on main). Only the doc.