docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker
This commit is contained in:
@@ -237,3 +237,64 @@ with a specific numerical path.
|
||||
Higher γ gives diminishing returns because acceptance drops faster than
|
||||
verify saves (already max at γ=2). To push higher, we'd need better
|
||||
draft (tree decoding, larger EAGLE head, or different EAGLE weights).
|
||||
|
||||
---
|
||||
|
||||
## Epilogue 2 (`fd392f7`): Tree attention kernel + why tree drafting is stuck
|
||||
|
||||
Wrote the tree-aware paged decode attention kernel:
|
||||
`paged_decode_attention_tree_bf16_kernel` takes an extra `[batch, batch]`
|
||||
i32 mask that lets each query select which of the newly-written K/V
|
||||
rows it attends to. Positions before `tree_start` always attended.
|
||||
|
||||
Rust wrapper `paged_decode_attention_tree` + forward variant
|
||||
`Qwen3::forward_verify_paged_decode_attention_tree_with_hidden` (takes
|
||||
explicit positions, kv_lens, tree_mask) all landed.
|
||||
|
||||
Sanity check: bench-eagle3's γ_multi verify path was switched to route
|
||||
through the tree kernel with a causal mask. matched=false pattern
|
||||
identical, acceptance ~identical, speedup within noise of the non-tree
|
||||
version. Kernel is correct.
|
||||
|
||||
### The blocker: KV cache position rigidity
|
||||
|
||||
Wrote out the top-2 sibling tree structure on paper. Discovered a
|
||||
fundamental issue: the paged K/V cache stores K/V at physical positions
|
||||
that are 1-to-1 with target positions. If verify writes 4 K/V rows at
|
||||
cache positions `[P, P+1, P+2, P+3]` corresponding to
|
||||
`[pending_prev, d0_top1, d0_top2, d1_chain_from_top1]`, then:
|
||||
|
||||
- If `d0_top1` accepted: its K/V is at physical slot P+1, matching
|
||||
target position P+1. Continuing decode from position P+1 reads the
|
||||
right K/V. ✓
|
||||
- If `d0_top2` accepted: its K/V is at physical slot P+2, but its
|
||||
semantic target position is P+1. Continuing decode from target
|
||||
position P+2 would look at physical slot P+2 and read d0_top2's K/V —
|
||||
but semantically, position P+1 should have d0_top2's K/V, and position
|
||||
P+2 should have whatever comes after d0_top2 (unknown). Continuing
|
||||
decode reads the wrong K/V. ✗
|
||||
|
||||
Fixing this requires one of:
|
||||
1. **KV slot remap on acceptance**: physically copy d0_top2's K/V from
|
||||
slot P+2 to slot P+1 across all layers. Costs one full-layer memcpy
|
||||
per acceptance of a non-top-1 sibling. Doable but adds ~2ms per event.
|
||||
2. **Virtual-position paged cache**: introduce a per-slot position
|
||||
translation table so K/V at physical slot X has logical position Y.
|
||||
Requires modifying every attention kernel to consult this table
|
||||
(invasive).
|
||||
3. **Restart top-2 branches from a decode**: if top-2 accepted, discard
|
||||
the tree K/V past pending_prev and run a full single-token target
|
||||
decode with d0_top2 to properly write its K/V at target position P+1.
|
||||
Costs ~1 full decode per accepted top-2, which likely eats the win.
|
||||
|
||||
Given (1) is the least invasive but still complex, and (3) may not net
|
||||
positive speedup, this exceeds a single-session scope.
|
||||
|
||||
**Concluding numbers on xserv main**:
|
||||
- Best speedup: **1.10×** at γ=2 (cuBLAS-GEMM verify, no tree).
|
||||
- Tree kernel + wrapper ready and correctness-verified.
|
||||
- Full tree drafting requires KV remap work (Phase 27+ scope).
|
||||
|
||||
Everything lands cleanly on `main`. Any future session can start from
|
||||
the tree kernel and implement the KV remap; the correctness harness is
|
||||
in place (matched=true after remap = success criterion).
|
||||
|
||||
Reference in New Issue
Block a user