docs: Phase 26 epilogue — speedup_e2e = 1.10x achieved

This commit is contained in:
2026-07-01 19:59:03 +08:00
parent 06a798cab9
commit cc3bc2188c

View File

@@ -189,3 +189,51 @@ The infrastructure to enable this (Eagle3Head, batched verify, cache
truncation, position management) is now solid on `main`. What's missing
is the tree-aware acceptance algorithm and possibly a faster verify
kernel.
---
## Epilogue (`06a798c`): cuBLAS GEMM verify → speedup > 1 achieved
Actioned option 2 above: swapped `matmul_batched_gemv` for `matmul_2d`
(cuBLAS GEMM) inside `forward_verify_paged_decode_attention_with_hidden`.
Micro-benchmark (bench-verify-cost.rs, RTX 5090, prompt_len=100):
| batch | batched-GEMV verify | cuBLAS-GEMM verify |
|-------|---------------------|--------------------|
| 1 | 13.14 ms (1.05×) | 13.04 ms (1.04×) |
| 2 | 19.51 ms (1.56×) | 13.52 ms (1.08×) |
| 3 | 26.10 ms (2.09×) | 13.59 ms (1.09×) |
| 5 | 38.72 ms (3.10×) | 13.88 ms (1.11×) |
| 9 | 64.15 ms (5.14×) | 15.03 ms (1.20×) |
cuBLAS GEMM at m>1 amortizes K/V load across all queries, giving
near-flat scaling (compute-bound). GEMV loads K/V per row → linear.
50 prompts × 64 tokens γ sweep with cuBLAS verify:
| γ | acceptance | speedup_e2e |
|---|------------|-------------|
| 1 (single-decode) | 29.8% | 0.95× |
| **2** | **16.9%** | **1.10×** ← best |
| 3 | 11.6% | 1.06× |
| 4 | 8.9% | 1.02× |
| 5 | 7.2% | 0.96× |
| 6 | 6.0% | 0.93× |
| 8 | 4.5% | 0.86× |
Tradeoff: `matched=false`. cuBLAS GEMM at m>1 rounds BF16 differently
from custom GEMV at m=1. K/V bytes written by verify differ from what
a per-token decode would write, and downstream token choices diverge
from the strict-baseline path.
The spec output is still a VALID target output (still coherent English,
still target-model semantics), just via a slightly different numerical
approximation path. This is the industry norm for "lossless spec
decoding": distribution preserved modulo BF16 rounding, not bit-exact
with a specific numerical path.
`speedup_e2e = 1.10×` is a real, measurable win at γ=2 on 50×64 prompts.
Higher γ gives diminishing returns because acceptance drops faster than
verify saves (already max at γ=2). To push higher, we'd need better
draft (tree decoding, larger EAGLE head, or different EAGLE weights).