Two subtle bugs found and fixed in the γ≥2 speculative loop:
1. Wrong position handling: cache.truncate_sequence(round_pos - 1) was
dropping the K/V of pending_prev, then verify OVERWROTE that slot with
the wrong token. Removed the truncate: verify now starts at
cache.seq_len (== position of pending_prev) and writes γ+1 tokens
forward. Also fixed EAGLE draft positions: pending_prev is at position
p, so step 0 uses position=p (not p+1).
2. EAGLE KV cache accumulated rejected drafts' K/V: each round writes γ
entries to EAGLE's cache regardless of how many drafts were accepted.
Added eagle.truncate_to(new_len) API. After each round, truncate to
eagle_len_before + k + 1 (pending_prev + k accepted drafts).
Also expose Eagle3Head::current_len() getter and Eagle3Head::truncate_to().
Additionally: return the PRE-norm hidden state as aux (matching vllm's
llama_eagle3.py default `norm_output=False`). Was returning the normed
version.
Result: matched=true across the full γ sweep. speedup_e2e remains <1:
γ=1 (single-decode verify): accept=22.7%, speedup=0.95x
γ=1 (batched verify): accept=20.6%, speedup=0.75x
γ=2: accept=12.6%, speedup=0.59x
γ=4: accept=7.6%, speedup=0.41x
γ=8: accept=4.1%, speedup=0.27x
Per-slot diagnostic shows d[0]≈15%, d[1]≈8%, d[2..γ-1] varies. d[0] is
lower than γ=1's 20% because batched verify introduces small numerical
differences vs single-token decode.
Larger γ hurts because:
- verify_cost scales roughly linearly with γ+1 (batched matmul at
batch=γ+1 costs ~γ+1× a single decode).
- accepted tokens per round grows sub-linearly (recursive EAGLE degrades).
- speedup ≈ (1 + accepted_avg) / verify_cost → below 1 across the sweep.
Path forward for speedup > 1 requires EITHER: (a) faster batched verify
(closer to single-decode cost per query row via better GPU utilization),
OR (b) better draft accuracy (tree-based drafting to explore multiple
candidates per position, larger EAGLE head, or a differently-trained
EAGLE variant).