xserv

Files

Gahow Wang a77239c0c8 speculative: Qwen3 decode graph + gamma sweep (Phase 24 step 2)

- Split Qwen3::forward_decode_paged into decode_prepare (host-side
  block allocation + table upload) and decode_core (pure-GPU compute
  reading token ids and positions from device buffers via
  embedding_device_ids + rope_inplace_device_pos). This makes the
  entire Qwen3 decode step CUDA-graph-capturable, mirroring the
  gpt_oss.rs architecture.
- Add qwen3_graph.rs: Qwen3DecodeGraph + GraphedQwen3Decoder, a port
  of the gpt_oss_graph.rs whole-step capture pattern. Lazy policy:
  first decode eager (warms pool + cuBLAS), second captures, rest
  replay. Batch>1 always falls back to eager.
- Wire GraphedQwen3Decoder into bench-speculative's draft decode path;
  all 4 draft.forward_decode_paged call sites + replay_draft_tokens
  now route through the graphed decoder. Per-benchmark caches persist
  across prompts for graph reuse.
- Gamma sweep result (10 prompts × 32 tokens, --use-verify-logits):
  γ=1 → 0.57×, γ=2 → 0.57×, γ=4 → 0.49×, γ=6 → 0.41×, γ=8 → 0.36×.
  All matched=true, verify_decode_mismatches=0.
  Acceptance drops sharply with γ (66% → 40% → 25%) because Qwen3-0.6B
  is too inaccurate a draft for Qwen3-8B. Speedup still <1.

Current ceiling analysis: verify costs ~13ms (same as one target decode)
so speculative decoding only wins if acceptance × (tokens/round) >>
(draft_cost + verify_cost) / baseline_decode. With this draft model,
the crossover requires either (a) a much smaller verify cost (batch-GEMM
path, which trades correctness), or (b) a fundamentally better drafter
(EAGLE-style heads, or n-gram lookup).

2026-07-01 16:32:17 +08:00

src

speculative: Qwen3 decode graph + gamma sweep (Phase 24 step 2)

2026-07-01 16:32:17 +08:00

Cargo.toml

fix(xserv-chat): UTF-8/CJK-aware line input

2026-05-29 11:36:54 +08:00