eagle3: fix EAGLE_HOOK_LAYERS to [2, 18, 33] for Qwen3-8B

The initial [11, 23, 35] (equally-spaced) guess was wrong — EAGLE3 heads
are trained against specific target layer indices, and using different
ones at inference gives wrong outputs. Correct values come from vLLM
speculators' training config for Qwen3-8B:

  https://github.com/vllm-project/speculators/blob/main/examples/train/
  dflash_qwen3_8b_sharegpt_online_5k.sh

which pins target_layer_ids to "2 18 33". Re-running check-eagle3 with
the fix produces coherent top-5 for "The capital of France is":

  Old ([11,23,35]): "," / " Paris" / " Madrid" / "." / " Berlin"
  New ([2,18,33]):  " Paris" / " Tokyo" / " Madrid" / "," / "."

Top-1 still differs from target's next token, but that's because EAGLE
compares (state_that_produced_prev, prev_token) → next, and the exact
pairing convention may need one more offset check when integrated into
the full speculative loop.
This commit is contained in:
2026-07-01 17:29:00 +08:00
parent e04a8ffb18
commit 8f11d6e5cd

View File

@@ -15,7 +15,11 @@ use std::path::Path;
use xserv_kernels::*;
use xserv_tensor::{DType, Device, Tensor};
pub const EAGLE_HOOK_LAYERS: [usize; 3] = [11, 23, 35];
/// Target layers to hook for EAGLE3 auxiliary hidden states, for Qwen3-8B
/// (36 layers). Value comes from AngelSlim/vLLM speculators training config
/// `dflash_qwen3_8b_sharegpt_online_5k.sh` which specifies target_layer_ids
/// = "2 18 33". Must match training-time selection or EAGLE outputs are wrong.
pub const EAGLE_HOOK_LAYERS: [usize; 3] = [2, 18, 33];
const DRAFT_VOCAB_SIZE: usize = 32000;
fn matmul_2d(a: &Tensor, b: &Tensor) -> Tensor {