Phase 22 lands a correctness-only speculative decoding loop for Qwen3
target + Qwen3 small draft (batch=1, greedy, gamma=4). Phase 23 turns
verify logits into the authoritative acceptance signal so mirror-decode
per accepted token is no longer needed.
- paged_kv_cache: truncate_sequence(slot, new_len) shrinks a registered
sequence, freeing whole physical blocks no longer reachable and
leaving the slot registered. Covered by a CUDA-gated unit test.
- qwen3: forward_verify_paged_decode_attention writes the draft window
into the target cache, runs the same paged decode attention kernel per
draft token, and uses matmul_rows_gemv so linear layers follow the
single-token decode BF16 rounding path.
- bench-speculative: new bench binary drives the state machine with
--gamma / --gen-tokens / --prompts / --use-verify-logits /
--verify-path flash|paged-decode / --dump-verify-mismatches, and
compares baseline vs spec token sequences plus TPOT / tok/s / speedup.
- docs/22 records the decode-authoritative v0 result and dash5 numbers
(matched=true, speedup_e2e ~0.29x, verify_decode_mismatches>0 under
--use-verify-logits).
- docs/23 records the paged-decode verify path (matched=true,
verify_decode_mismatches=0, 50x64 speedup_e2e ~0.44x) and the
next-step performance TODO.