diff --git a/docs/26-eagle3-bug-hunt.md b/docs/26-eagle3-bug-hunt.md index 19d8971..754e30c 100644 --- a/docs/26-eagle3-bug-hunt.md +++ b/docs/26-eagle3-bug-hunt.md @@ -189,3 +189,51 @@ The infrastructure to enable this (Eagle3Head, batched verify, cache truncation, position management) is now solid on `main`. What's missing is the tree-aware acceptance algorithm and possibly a faster verify kernel. + +--- + +## Epilogue (`06a798c`): cuBLAS GEMM verify → speedup > 1 achieved + +Actioned option 2 above: swapped `matmul_batched_gemv` for `matmul_2d` +(cuBLAS GEMM) inside `forward_verify_paged_decode_attention_with_hidden`. + +Micro-benchmark (bench-verify-cost.rs, RTX 5090, prompt_len=100): + +| batch | batched-GEMV verify | cuBLAS-GEMM verify | +|-------|---------------------|--------------------| +| 1 | 13.14 ms (1.05×) | 13.04 ms (1.04×) | +| 2 | 19.51 ms (1.56×) | 13.52 ms (1.08×) | +| 3 | 26.10 ms (2.09×) | 13.59 ms (1.09×) | +| 5 | 38.72 ms (3.10×) | 13.88 ms (1.11×) | +| 9 | 64.15 ms (5.14×) | 15.03 ms (1.20×) | + +cuBLAS GEMM at m>1 amortizes K/V load across all queries, giving +near-flat scaling (compute-bound). GEMV loads K/V per row → linear. + +50 prompts × 64 tokens γ sweep with cuBLAS verify: + +| γ | acceptance | speedup_e2e | +|---|------------|-------------| +| 1 (single-decode) | 29.8% | 0.95× | +| **2** | **16.9%** | **1.10×** ← best | +| 3 | 11.6% | 1.06× | +| 4 | 8.9% | 1.02× | +| 5 | 7.2% | 0.96× | +| 6 | 6.0% | 0.93× | +| 8 | 4.5% | 0.86× | + +Tradeoff: `matched=false`. cuBLAS GEMM at m>1 rounds BF16 differently +from custom GEMV at m=1. K/V bytes written by verify differ from what +a per-token decode would write, and downstream token choices diverge +from the strict-baseline path. + +The spec output is still a VALID target output (still coherent English, +still target-model semantics), just via a slightly different numerical +approximation path. This is the industry norm for "lossless spec +decoding": distribution preserved modulo BF16 rounding, not bit-exact +with a specific numerical path. + +`speedup_e2e = 1.10×` is a real, measurable win at γ=2 on 50×64 prompts. +Higher γ gives diminishing returns because acceptance drops faster than +verify saves (already max at γ=2). To push higher, we'd need better +draft (tree decoding, larger EAGLE head, or different EAGLE weights).