BF16 greedy decode was sensitive to inter-block scheduling when logits
were close, which broke speculative-decoding verify-vs-decode parity.
- gemv.cu: write per-K-block partials, then reduce in fixed block order
in a second kernel instead of atomicAdd across K-blocks. Scratch
buffer size is now n * ceil(k / GEMV_TILE_K); gemv_scratch_elems()
exposes this to callers, and decode_graph.rs sizes fp32_hidden/q/kv/
intermediate/vocab from it.
- paged_attention.cu: replace atomicAdd merge of warp outputs with
per-warp shared partials reduced in warp-id order for both the base
and sinks kernels.