Two fixes to bring EAGLE3 forward in line with vllm's llama_eagle3.py
reference:
1. Residual chain: previously the residual added into post_attention_layernorm
was the token embedding (wrong). Reference uses _norm_after_residual:
residual = fused_h (pre-norm)
hidden_states = hidden_norm(fused_h)
Then post_attention_layernorm is a fused add_rmsnorm(attn_out, residual),
and the final norm is another add_rmsnorm(mlp_out, residual_after_attn).
Neither residual carries the embedding — both carry fused_h forward.
2. KV cache: previously the attention was approximated as "output = V"
because seq_len=1 (no cache), effectively giving EAGLE no history.
Add a real per-Eagle3Head KV cache (1 layer × [1, num_kv_heads,
max_seq_len, head_dim] BF16) that grows as we call step(). Use the
existing decode_attention kernel with a fresh contiguous slice of the
cache each step. reset() clears current_len for a new sequence.
Result on 10 prompts × 32 tokens (γ=1, no batched verify yet):
matched=true across all prompts
acceptance_rate = 20.0% (was 4.7% before residual fix, 1.3% originally)
- Prompt 00 "The capital of France is": 60% (18/30) — best case
- Other prompts: 10-25% — matches EAGLE paper's observation that
structured/factual prompts get higher acceptance
Sanity check (check-eagle3) on Paris prompt now shows:
EAGLE top-5 pairing A: "." / " is" / "," / " Paris" / ".\n"
MATCH: EAGLE agrees with target on next token.
speedup_e2e still 0.95x because γ=1 does 1 target decode per token
regardless of acceptance. Real speedup requires γ≥2 with a single
batched target-verify covering all γ draft tokens; that's the next step.