Adds infrastructure for γ≥2 EAGLE speculative decoding:
qwen3.rs:
- New forward_verify_paged_decode_attention_with_hidden: same as the
existing verify but also captures target hidden states at 3 hook
layers, one per verify position. Needed to seed next round's EAGLE.
eagle3.rs:
- step split into step (unchanged public API) + step_with_aux (also
returns final hidden state) + step_recursive (takes fused_h directly,
no fc+3-hidden combine). This mirrors the EAGLE3 paper: γ=1 uses
target hooks + fc; γ≥2 uses previous EAGLE aux as fused_h for
subsequent drafts, approximating target hidden.
bench-eagle3.rs:
- New run_eagle_gamma_multi function with --gamma CLI (default 2).
- Per round: recursive EAGLE γ drafts, verify [prev_token, d0..d_{γ-1}]
in one target forward, accept longest prefix, correction via 1 more
target decode.
- max_seqs bumped to 16 in the paged cache so verify can batch up to
16 rows.
γ=2 test result (5 prompts × 32 tokens, dash5):
matched=false — sequences diverge
acceptance_rate = 29.8% at γ=2 (~1.1 tokens accepted per draft)
speedup_e2e = 0.52x (SLOWER than baseline)
The divergence bug is in the verify's re-writing of prev_token's K/V
at position round_pos-1. In principle matmul_batched_gemv at row-0
should be bit-exact with the seed decode's launch_gemv_bf16, but the
sequence output diverges so something is off. Investigation pending
(likely the correction decode step or seed_hooks position offset).
γ=1 path still works correctly (matched=true, acceptance 20%,
speedup 0.95x) from the previous commit. The γ≥2 path is scaffolded
but not yet correct — next step is to debug the verify-write path,
then measure real speedup.