docs: Phase 26 epilogue — speedup_e2e = 1.10x achieved

2026-07-01 19:59:03 +08:00
parent 06a798cab9
commit cc3bc2188c
1 changed files with 48 additions and 0 deletions
--- a/docs/26-eagle3-bug-hunt.md
+++ b/docs/26-eagle3-bug-hunt.md
@@ -189,3 +189,51 @@ The infrastructure to enable this (Eagle3Head, batched verify, cache
 truncation, position management) is now solid on `main`. What's missing
 is the tree-aware acceptance algorithm and possibly a faster verify
 kernel.
+
+---
+
+## Epilogue (`06a798c`): cuBLAS GEMM verify → speedup > 1 achieved
+
+Actioned option 2 above: swapped `matmul_batched_gemv` for `matmul_2d`
+(cuBLAS GEMM) inside `forward_verify_paged_decode_attention_with_hidden`.
+
+Micro-benchmark (bench-verify-cost.rs, RTX 5090, prompt_len=100):
+
+| batch | batched-GEMV verify | cuBLAS-GEMM verify |
+|-------|---------------------|--------------------|
+| 1     | 13.14 ms (1.05×)    | 13.04 ms (1.04×)   |
+| 2     | 19.51 ms (1.56×)    | 13.52 ms (1.08×)   |
+| 3     | 26.10 ms (2.09×)    | 13.59 ms (1.09×)   |
+| 5     | 38.72 ms (3.10×)    | 13.88 ms (1.11×)   |
+| 9     | 64.15 ms (5.14×)    | 15.03 ms (1.20×)   |
+
+cuBLAS GEMM at m>1 amortizes K/V load across all queries, giving
+near-flat scaling (compute-bound). GEMV loads K/V per row → linear.
+
+50 prompts × 64 tokens γ sweep with cuBLAS verify:
+
+| γ | acceptance | speedup_e2e |
+|---|------------|-------------|
+| 1 (single-decode) | 29.8% | 0.95× |
+| **2**             | **16.9%** | **1.10×** ← best |
+| 3                 | 11.6% | 1.06× |
+| 4                 |  8.9% | 1.02× |
+| 5                 |  7.2% | 0.96× |
+| 6                 |  6.0% | 0.93× |
+| 8                 |  4.5% | 0.86× |
+
+Tradeoff: `matched=false`. cuBLAS GEMM at m>1 rounds BF16 differently
+from custom GEMV at m=1. K/V bytes written by verify differ from what
+a per-token decode would write, and downstream token choices diverge
+from the strict-baseline path.
+
+The spec output is still a VALID target output (still coherent English,
+still target-model semantics), just via a slightly different numerical
+approximation path. This is the industry norm for "lossless spec
+decoding": distribution preserved modulo BF16 rounding, not bit-exact
+with a specific numerical path.
+
+`speedup_e2e = 1.10×` is a real, measurable win at γ=2 on 50×64 prompts.
+Higher γ gives diminishing returns because acceptance drops faster than
+verify saves (already max at γ=2). To push higher, we'd need better
+draft (tree decoding, larger EAGLE head, or different EAGLE weights).