Swap forward_verify_paged_decode_attention_with_hidden's projections
from matmul_batched_gemv (per-row bit-exact GEMV) to matmul_2d (cuBLAS
GEMM at m>1). This trades bit-exact parity with baseline for a much
cheaper batched verify.
Micro-benchmark (bench-verify-cost.rs) reveals the huge cost gap:
batched-GEMV verify: 1.05× → 5.14× single decode (linear in batch)
cuBLAS-GEMM verify: 1.04× → 1.20× single decode (nearly flat)
At batch=9 the difference is 4.3× — cuBLAS amortizes K/V load across
all queries while GEMV loads K/V for each row independently.
50 prompts × 64 tokens γ sweep on dash5 (Qwen3-8B + Qwen3-8B_eagle3):
γ=2: acceptance=16.9%, speedup_e2e = 1.10× ← best
γ=3: acceptance=11.6%, speedup_e2e = 1.06×
γ=4: acceptance=8.9%, speedup_e2e = 1.02×
γ>4: speedup drops as acceptance falls faster than verify saves.
Tradeoff: matched=false — spec output diverges from baseline single-
decode by a few tokens per prompt because cuBLAS GEMM at m>1 rounds
BF16 differently from custom GEMV at m=1, so the K/V bytes written by
verify aren't bit-exact with what a single-token decode would write.
Downstream this compounds into slightly different token choices.
The spec output is still a VALID target model output — it's just via
a different numerical path. Semantically the outputs are indistinguishable
(both coherent English continuations of the prompt). This is the
industry-standard interpretation of "lossless spec decoding": target
distribution preserved modulo BF16 rounding, not bit-exact with a
specific numerical path.
New: crates/xserv-model/src/bin/bench-verify-cost.rs — micro-benchmark
that measures verify cost at various batch sizes, isolating the impact
of the GEMV vs GEMM choice.