eagle3: fix EAGLE_HOOK_LAYERS to [2, 18, 33] for Qwen3-8B

The initial [11, 23, 35] (equally-spaced) guess was wrong — EAGLE3 heads are trained against specific target layer indices, and using different ones at inference gives wrong outputs. Correct values come from vLLM speculators' training config for Qwen3-8B: https://github.com/vllm-project/speculators/blob/main/examples/train/ dflash_qwen3_8b_sharegpt_online_5k.sh which pins target_layer_ids to "2 18 33". Re-running check-eagle3 with the fix produces coherent top-5 for "The capital of France is": Old ([11,23,35]): "," / " Paris" / " Madrid" / "." / " Berlin" New ([2,18,33]): " Paris" / " Tokyo" / " Madrid" / "," / "." Top-1 still differs from target's next token, but that's because EAGLE compares (state_that_produced_prev, prev_token) → next, and the exact pairing convention may need one more offset check when integrated into the full speculative loop.
2026-07-01 17:29:00 +08:00
parent e04a8ffb18
commit 8f11d6e5cd
1 changed files with 5 additions and 1 deletions
--- a/crates/xserv-model/src/eagle3.rs
+++ b/crates/xserv-model/src/eagle3.rs
@@ -15,7 +15,11 @@ use std::path::Path;
 use xserv_kernels::*;
 use xserv_tensor::{DType, Device, Tensor};

-pub const EAGLE_HOOK_LAYERS: [usize; 3] = [11, 23, 35];
+/// Target layers to hook for EAGLE3 auxiliary hidden states, for Qwen3-8B
+/// (36 layers). Value comes from AngelSlim/vLLM speculators training config
+/// `dflash_qwen3_8b_sharegpt_online_5k.sh` which specifies target_layer_ids
+/// = "2 18 33". Must match training-time selection or EAGLE outputs are wrong.
+pub const EAGLE_HOOK_LAYERS: [usize; 3] = [2, 18, 33];
 const DRAFT_VOCAB_SIZE: usize = 32000;

 fn matmul_2d(a: &Tensor, b: &Tensor) -> Tensor {