Files
xserv/csrc/attention
Gahow Wang fd392f7fbb attention: tree-aware paged_decode_attention_tree kernel + wrapper
New CUDA kernel paged_decode_attention_tree_bf16_kernel: same as base
paged_decode_attention but with a per-query mask over the newly-written
K/V region. `tree_mask[i][j] != 0` iff query i attends to newly-written
K/V at slot j. Positions before `tree_start` are always attended.

Motivation: speculative decoding with tree drafting needs siblings at
the same target position to attend to their own branch's history, not
each other's K/V.

Rust binding: paged_decode_attention_tree(...) mirrors
paged_decode_attention plus tree_mask_ptr, tree_start, tree_len.

Forward path: Qwen3::forward_verify_paged_decode_attention_tree_with_hidden
takes explicit positions, kv_lens, and a flattened [N*N] tree_mask.

Sanity check: bench-eagle3's γ_multi path now routes through the tree
kernel with a causal mask (mask[i][j]=1 iff j<=i), producing bit-
equivalent output to the non-tree variant. matched=false pattern +
acceptance rate + speedup all identical to previous run within noise
(11.3% acceptance, 1.00× speedup with the mask-check overhead).

--tree CLI flag is parsed but reserved. Real tree drafting (siblings
sharing a target position) is blocked by KV cache position rigidity:
paged_cache stores K/V at cache-position ≡ target-position, so an
accepted sibling at target position P+1 has its K/V physically at
cache position P+2 (its unique slot in the batched write). Continuing
decode at P+1 would see the WRONG K/V (top-1 sibling's, not accepted
top-2 sibling's). Fix requires either KV-slot remap on acceptance or
a virtual position layer.

Infrastructure is in place, next step is tackling that remap.
2026-07-01 20:45:55 +08:00
..