xserv

Files

Gahow Wang d2c55c47b2 eagle3: γ≥2 correctness fixes + per-slot diagnostic

Two subtle bugs found and fixed in the γ≥2 speculative loop:

1. Wrong position handling: cache.truncate_sequence(round_pos - 1) was
   dropping the K/V of pending_prev, then verify OVERWROTE that slot with
   the wrong token. Removed the truncate: verify now starts at
   cache.seq_len (== position of pending_prev) and writes γ+1 tokens
   forward. Also fixed EAGLE draft positions: pending_prev is at position
   p, so step 0 uses position=p (not p+1).

2. EAGLE KV cache accumulated rejected drafts' K/V: each round writes γ
   entries to EAGLE's cache regardless of how many drafts were accepted.
   Added eagle.truncate_to(new_len) API. After each round, truncate to
   eagle_len_before + k + 1 (pending_prev + k accepted drafts).

Also expose Eagle3Head::current_len() getter and Eagle3Head::truncate_to().

Additionally: return the PRE-norm hidden state as aux (matching vllm's
llama_eagle3.py default `norm_output=False`). Was returning the normed
version.

Result: matched=true across the full γ sweep. speedup_e2e remains <1:

  γ=1 (single-decode verify): accept=22.7%, speedup=0.95x
  γ=1 (batched verify):       accept=20.6%, speedup=0.75x
  γ=2:                         accept=12.6%, speedup=0.59x
  γ=4:                         accept=7.6%,  speedup=0.41x
  γ=8:                         accept=4.1%,  speedup=0.27x

Per-slot diagnostic shows d[0]≈15%, d[1]≈8%, d[2..γ-1] varies. d[0] is
lower than γ=1's 20% because batched verify introduces small numerical
differences vs single-token decode.

Larger γ hurts because:
- verify_cost scales roughly linearly with γ+1 (batched matmul at
  batch=γ+1 costs ~γ+1× a single decode).
- accepted tokens per round grows sub-linearly (recursive EAGLE degrades).
- speedup ≈ (1 + accepted_avg) / verify_cost → below 1 across the sweep.

Path forward for speedup > 1 requires EITHER: (a) faster batched verify
(closer to single-decode cost per query row via better GPU utilization),
OR (b) better draft accuracy (tree-based drafting to explore multiple
candidates per position, larger EAGLE head, or a differently-trained
EAGLE variant).

2026-07-01 19:16:31 +08:00

xserv-cuda

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-distributed

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-kernels

speculative: batched-GEMV kernel for verify path (Phase 24 step 1)

2026-07-01 16:13:37 +08:00

xserv-model

eagle3: γ≥2 correctness fixes + per-slot diagnostic