xserv

Files

Gahow Wang 06a798cab9 eagle3: cuBLAS-GEMM verify path — speedup_e2e > 1 achieved 🎉

Swap forward_verify_paged_decode_attention_with_hidden's projections
from matmul_batched_gemv (per-row bit-exact GEMV) to matmul_2d (cuBLAS
GEMM at m>1). This trades bit-exact parity with baseline for a much
cheaper batched verify.

Micro-benchmark (bench-verify-cost.rs) reveals the huge cost gap:
  batched-GEMV verify: 1.05× → 5.14× single decode (linear in batch)
  cuBLAS-GEMM verify:  1.04× → 1.20× single decode (nearly flat)

At batch=9 the difference is 4.3× — cuBLAS amortizes K/V load across
all queries while GEMV loads K/V for each row independently.

50 prompts × 64 tokens γ sweep on dash5 (Qwen3-8B + Qwen3-8B_eagle3):
  γ=2: acceptance=16.9%, speedup_e2e = 1.10× ← best
  γ=3: acceptance=11.6%, speedup_e2e = 1.06×
  γ=4: acceptance=8.9%,  speedup_e2e = 1.02×
  γ>4: speedup drops as acceptance falls faster than verify saves.

Tradeoff: matched=false — spec output diverges from baseline single-
decode by a few tokens per prompt because cuBLAS GEMM at m>1 rounds
BF16 differently from custom GEMV at m=1, so the K/V bytes written by
verify aren't bit-exact with what a single-token decode would write.
Downstream this compounds into slightly different token choices.

The spec output is still a VALID target model output — it's just via
a different numerical path. Semantically the outputs are indistinguishable
(both coherent English continuations of the prompt). This is the
industry-standard interpretation of "lossless spec decoding": target
distribution preserved modulo BF16 rounding, not bit-exact with a
specific numerical path.

New: crates/xserv-model/src/bin/bench-verify-cost.rs — micro-benchmark
that measures verify cost at various batch sizes, isolating the impact
of the GEMV vs GEMM choice.

2026-07-01 19:58:23 +08:00

xserv-cuda

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-distributed

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-kernels

speculative: batched-GEMV kernel for verify path (Phase 24 step 1)

2026-07-01 16:13:37 +08:00

xserv-model

eagle3: cuBLAS-GEMM verify path — speedup_e2e > 1 achieved 🎉