The rollout long-pole fix deferred from M2a: decode the G samples of one prompt
in lockstep (one forward per step over the group → G× fewer kernel launches).
- rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward-
only kernel) — G rows share one decode position. Gate: == full rope for
[0..n], == rope_at(P) per row for uniform P (bit-identical).
- generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step.
decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G)
broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next).
- Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single-
sequence decode (8 query / 2 kv heads → exercises repeat_kv batching).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>