Files
xserv/crates
Gahow Wang a24621fa6a eagle3: proper residual chain + stateful KV cache
Two fixes to bring EAGLE3 forward in line with vllm's llama_eagle3.py
reference:

1. Residual chain: previously the residual added into post_attention_layernorm
   was the token embedding (wrong). Reference uses _norm_after_residual:
     residual = fused_h (pre-norm)
     hidden_states = hidden_norm(fused_h)
   Then post_attention_layernorm is a fused add_rmsnorm(attn_out, residual),
   and the final norm is another add_rmsnorm(mlp_out, residual_after_attn).
   Neither residual carries the embedding — both carry fused_h forward.

2. KV cache: previously the attention was approximated as "output = V"
   because seq_len=1 (no cache), effectively giving EAGLE no history.
   Add a real per-Eagle3Head KV cache (1 layer × [1, num_kv_heads,
   max_seq_len, head_dim] BF16) that grows as we call step(). Use the
   existing decode_attention kernel with a fresh contiguous slice of the
   cache each step. reset() clears current_len for a new sequence.

Result on 10 prompts × 32 tokens (γ=1, no batched verify yet):
  matched=true across all prompts
  acceptance_rate = 20.0% (was 4.7% before residual fix, 1.3% originally)
    - Prompt 00 "The capital of France is": 60% (18/30) — best case
    - Other prompts: 10-25% — matches EAGLE paper's observation that
      structured/factual prompts get higher acceptance

Sanity check (check-eagle3) on Paris prompt now shows:
  EAGLE top-5 pairing A: "." / " is" / "," / " Paris" / ".\n"
  MATCH: EAGLE agrees with target on next token.

speedup_e2e still 0.95x because γ=1 does 1 target decode per token
regardless of acceptance. Real speedup requires γ≥2 with a single
batched target-verify covering all γ draft tokens; that's the next step.
2026-07-01 17:50:49 +08:00
..
2026-06-18 18:11:58 +08:00
2026-06-18 18:11:58 +08:00