- Split Qwen3::forward_decode_paged into decode_prepare (host-side
block allocation + table upload) and decode_core (pure-GPU compute
reading token ids and positions from device buffers via
embedding_device_ids + rope_inplace_device_pos). This makes the
entire Qwen3 decode step CUDA-graph-capturable, mirroring the
gpt_oss.rs architecture.
- Add qwen3_graph.rs: Qwen3DecodeGraph + GraphedQwen3Decoder, a port
of the gpt_oss_graph.rs whole-step capture pattern. Lazy policy:
first decode eager (warms pool + cuBLAS), second captures, rest
replay. Batch>1 always falls back to eager.
- Wire GraphedQwen3Decoder into bench-speculative's draft decode path;
all 4 draft.forward_decode_paged call sites + replay_draft_tokens
now route through the graphed decoder. Per-benchmark caches persist
across prompts for graph reuse.
- Gamma sweep result (10 prompts × 32 tokens, --use-verify-logits):
γ=1 → 0.57×, γ=2 → 0.57×, γ=4 → 0.49×, γ=6 → 0.41×, γ=8 → 0.36×.
All matched=true, verify_decode_mismatches=0.
Acceptance drops sharply with γ (66% → 40% → 25%) because Qwen3-0.6B
is too inaccurate a draft for Qwen3-8B. Speedup still <1.
Current ceiling analysis: verify costs ~13ms (same as one target decode)
so speculative decoding only wins if acceptance × (tokens/round) >>
(draft_cost + verify_cost) / baseline_decode. With this draft model,
the crossover requires either (a) a much smaller verify cost (batch-GEMM
path, which trades correctness), or (b) a fundamentally better drafter
(EAGLE-style heads, or n-gram lookup).