docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker

This commit is contained in:
2026-07-01 20:46:28 +08:00
parent fd392f7fbb
commit 40d8a29e33

View File

@@ -237,3 +237,64 @@ with a specific numerical path.
Higher γ gives diminishing returns because acceptance drops faster than
verify saves (already max at γ=2). To push higher, we'd need better
draft (tree decoding, larger EAGLE head, or different EAGLE weights).
---
## Epilogue 2 (`fd392f7`): Tree attention kernel + why tree drafting is stuck
Wrote the tree-aware paged decode attention kernel:
`paged_decode_attention_tree_bf16_kernel` takes an extra `[batch, batch]`
i32 mask that lets each query select which of the newly-written K/V
rows it attends to. Positions before `tree_start` always attended.
Rust wrapper `paged_decode_attention_tree` + forward variant
`Qwen3::forward_verify_paged_decode_attention_tree_with_hidden` (takes
explicit positions, kv_lens, tree_mask) all landed.
Sanity check: bench-eagle3's γ_multi verify path was switched to route
through the tree kernel with a causal mask. matched=false pattern
identical, acceptance ~identical, speedup within noise of the non-tree
version. Kernel is correct.
### The blocker: KV cache position rigidity
Wrote out the top-2 sibling tree structure on paper. Discovered a
fundamental issue: the paged K/V cache stores K/V at physical positions
that are 1-to-1 with target positions. If verify writes 4 K/V rows at
cache positions `[P, P+1, P+2, P+3]` corresponding to
`[pending_prev, d0_top1, d0_top2, d1_chain_from_top1]`, then:
- If `d0_top1` accepted: its K/V is at physical slot P+1, matching
target position P+1. Continuing decode from position P+1 reads the
right K/V. ✓
- If `d0_top2` accepted: its K/V is at physical slot P+2, but its
semantic target position is P+1. Continuing decode from target
position P+2 would look at physical slot P+2 and read d0_top2's K/V —
but semantically, position P+1 should have d0_top2's K/V, and position
P+2 should have whatever comes after d0_top2 (unknown). Continuing
decode reads the wrong K/V. ✗
Fixing this requires one of:
1. **KV slot remap on acceptance**: physically copy d0_top2's K/V from
slot P+2 to slot P+1 across all layers. Costs one full-layer memcpy
per acceptance of a non-top-1 sibling. Doable but adds ~2ms per event.
2. **Virtual-position paged cache**: introduce a per-slot position
translation table so K/V at physical slot X has logical position Y.
Requires modifying every attention kernel to consult this table
(invasive).
3. **Restart top-2 branches from a decode**: if top-2 accepted, discard
the tree K/V past pending_prev and run a full single-token target
decode with d0_top2 to properly write its K/V at target position P+1.
Costs ~1 full decode per accepted top-2, which likely eats the win.
Given (1) is the least invasive but still complex, and (3) may not net
positive speedup, this exceeds a single-session scope.
**Concluding numbers on xserv main**:
- Best speedup: **1.10×** at γ=2 (cuBLAS-GEMM verify, no tree).
- Tree kernel + wrapper ready and correctness-verified.
- Full tree drafting requires KV remap work (Phase 27+ scope).
Everything lands cleanly on `main`. Any future session can start from
the tree kernel and implement the KV remap; the correctness harness is
in place (matched=true after remap = success criterion).