bench-eagle3.rs runs the full loop: prefill → for each output token, one
EAGLE draft + one target decode with hidden state hook. Measures
acceptance rate and speedup vs pure target decode.
First numbers on dash5 (10 prompts × 32 tokens, γ=1):
matched=true (10/10)
acceptance_rate=1.3% (4/300) ← should be ~60-70% per EAGLE3 paper
speedup_e2e=0.95× ← below 1 because γ=1 does 1 target
decode per output token regardless of
acceptance
target_steps=320 for 320 tokens
Positive: the plumbing is correct — target/EAGLE both run without error,
output sequences match baseline, all shapes/dtypes check out. The
sanity check earlier showed EAGLE top-5 contains thematically-plausible
tokens (Paris/Tokyo/Madrid for "capital of France is").
Negative: 1.3% acceptance means EAGLE is not currently learning to match
target's greedy top-1. Root causes to investigate:
1. Token/hook pairing convention. Paper uses (h_that_produced_t_i, t_i)
→ predicts t_{i+1}. My bench does the same but sanity check earlier
suggested pairing might be one off.
2. Missing "training-time test" projection: EAGLE was trained to feed
its own prev output as fused_h for the next step (γ>1 chaining).
Currently we always use target hooks, which is what pairing A/B do
for γ=1, but may not be aligned with training-time behavior.
3. Hook site: I capture x AFTER the residual+MLP. Paper may want x
BEFORE, or the "hidden_states" as used by the final norm+lm_head.
Currently the same tensor feeds into final norm during the target
forward, so pre/post-residual is what I have — but confirming
against reference Python impl is needed.
4. Weight loading: transposes assume [in,out] → [out,in]. Need to
validate at least one output layer's shape against expected.
Next step (deferred to another session): download AngelSlim reference
inference code, run same prompt through it, compare intermediate
activations at each stage to isolate the discrepancy.