docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker

2026-07-01 20:46:28 +08:00
parent fd392f7fbb
commit 40d8a29e33
1 changed files with 61 additions and 0 deletions
--- a/docs/26-eagle3-bug-hunt.md
+++ b/docs/26-eagle3-bug-hunt.md
@@ -237,3 +237,64 @@ with a specific numerical path.
 Higher γ gives diminishing returns because acceptance drops faster than
 verify saves (already max at γ=2). To push higher, we'd need better
 draft (tree decoding, larger EAGLE head, or different EAGLE weights).
+
+---
+
+## Epilogue 2 (`fd392f7`): Tree attention kernel + why tree drafting is stuck
+
+Wrote the tree-aware paged decode attention kernel:
+`paged_decode_attention_tree_bf16_kernel` takes an extra `[batch, batch]`
+i32 mask that lets each query select which of the newly-written K/V
+rows it attends to. Positions before `tree_start` always attended.
+
+Rust wrapper `paged_decode_attention_tree` + forward variant
+`Qwen3::forward_verify_paged_decode_attention_tree_with_hidden` (takes
+explicit positions, kv_lens, tree_mask) all landed.
+
+Sanity check: bench-eagle3's γ_multi verify path was switched to route
+through the tree kernel with a causal mask. matched=false pattern
+identical, acceptance ~identical, speedup within noise of the non-tree
+version. Kernel is correct.
+
+### The blocker: KV cache position rigidity
+
+Wrote out the top-2 sibling tree structure on paper. Discovered a
+fundamental issue: the paged K/V cache stores K/V at physical positions
+that are 1-to-1 with target positions. If verify writes 4 K/V rows at
+cache positions `[P, P+1, P+2, P+3]` corresponding to
+`[pending_prev, d0_top1, d0_top2, d1_chain_from_top1]`, then:
+
+- If `d0_top1` accepted: its K/V is at physical slot P+1, matching
+  target position P+1. Continuing decode from position P+1 reads the
+  right K/V. ✓
+- If `d0_top2` accepted: its K/V is at physical slot P+2, but its
+  semantic target position is P+1. Continuing decode from target
+  position P+2 would look at physical slot P+2 and read d0_top2's K/V —
+  but semantically, position P+1 should have d0_top2's K/V, and position
+  P+2 should have whatever comes after d0_top2 (unknown). Continuing
+  decode reads the wrong K/V. ✗
+
+Fixing this requires one of:
+1. **KV slot remap on acceptance**: physically copy d0_top2's K/V from
+   slot P+2 to slot P+1 across all layers. Costs one full-layer memcpy
+   per acceptance of a non-top-1 sibling. Doable but adds ~2ms per event.
+2. **Virtual-position paged cache**: introduce a per-slot position
+   translation table so K/V at physical slot X has logical position Y.
+   Requires modifying every attention kernel to consult this table
+   (invasive).
+3. **Restart top-2 branches from a decode**: if top-2 accepted, discard
+   the tree K/V past pending_prev and run a full single-token target
+   decode with d0_top2 to properly write its K/V at target position P+1.
+   Costs ~1 full decode per accepted top-2, which likely eats the win.
+
+Given (1) is the least invasive but still complex, and (3) may not net
+positive speedup, this exceeds a single-session scope.
+
+**Concluding numbers on xserv main**:
+- Best speedup: **1.10×** at γ=2 (cuBLAS-GEMM verify, no tree).
+- Tree kernel + wrapper ready and correctness-verified.
+- Full tree drafting requires KV remap work (Phase 27+ scope).
+
+Everything lands cleanly on `main`. Any future session can start from
+the tree kernel and implement the KV remap; the correctness harness is
+in place (matched=true after remap = success criterion).