Commit Graph

4 Commits

Author SHA1 Message Date
3a3425960c post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift
Device-resident KV cache: keep K/V on the GPU as [bh,T,hd], grow by one token
per step via a new cat_seq kernel (concat along seq) — removes the M2a/M2b
per-layer host round-trip (to_cpu/from_slice/re-upload) AND the transpose_3d01.
Both single-seq and batched decode refactored to it; cache is Option<Tensor>
per layer (cleaner than the host Vec + rebuild).

Gates all hold: cat_seq == host concat; decode_kv single-seq + decode_batch
G-way both still TOKEN-IDENTICAL; GQA training path unaffected.

Honest measurement (the point): removing the host round-trip buys ~10% on pure
single-seq decode (133 → 147 tok/s @128) but does NOT move the GRPO step
(~8.5 s/step unchanged) — because after M2b batching the rollout is no longer
the step's bottleneck; the per-sample per_token_logp captures + the PG-update
forwards/backwards (model.forward, full-seq) now dominate. Measure-first lesson
(cf. T11/T17/M2a): the long pole shifted to the training-side forwards; the next
decode lever (ragged batched prefill) targets those, not the cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:38:16 +08:00
2c9b58cb3b post-train: M2b — batched KV-cache decode (G-way, token-identical)
The rollout long-pole fix deferred from M2a: decode the G samples of one prompt
in lockstep (one forward per step over the group → G× fewer kernel launches).

- rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward-
  only kernel) — G rows share one decode position. Gate: == full rope for
  [0..n], == rope_at(P) per row for uniform P (bit-identical).
- generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step.
  decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G)
  broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next).
- Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single-
  sequence decode (8 query / 2 kv heads → exercises repeat_kv batching).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:18:54 +08:00
7fb3b32fd9 post-train: M4 — GRPO actor-learner loop + cached temperature rollout
train_grpo: the online, critic-free RL loop — per step sample B prompts, roll
out G completions each, score with the rule-based checker (reward 0/1), compute
group-relative advantage A=(r−mean)/(std+ε), then K inner clipped_pg_loss
epochs with a KL leash to the frozen reference. Reward = pure 0/1 correctness
(KL is the format protector, the M3 collapse lesson). Tracks mean rollout reward
(the falsifiable "it learns" signal). Periodic checkpoint save.

decode: generate_cached adds temperature sampling to the KV-cache engine (M2) —
single-row [1,vocab] logits per step vs the naive sampler's [seq,vocab], far
lighter on the caching allocator (the naive sampler fragments it over a long
rollout). generate_greedy_cached now routes through it (temp 0); decode_kv
token-identical gate still passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 16:59:05 +08:00
eff26a0898 post-train: M2a — KV-cache incremental decode engine (token-identical)
Single-sequence KV-cache decode (xtrain-model/src/decode.rs): per-layer K/V
cache + single-token incremental forward (prefill = first prompt.len() decode
steps, one code path). Mirrors model::block_forward at the raw-Tensor level (no
autograd tape — inference needs no grads), using rope_at + decode_attention.
Cache is host-accumulated token-major f32, rebuilt per step (the honest M2a
baseline; M2b moves it device-side + batched ragged).

Gate (the M2 centerpiece): KV-cache greedy decode is TOKEN-IDENTICAL to the
naive full-recompute greedy — tests/decode_kv.rs (small GQA model, F32, 24
tokens) and corroborated on the v12 1.05B SFT checkpoint (cached eval =
naive eval byte-for-byte: format 100/100, correct 8/100).

eval_arith --cached A/Bs the two paths + reports decode tok/s. Measured on v12
(1.05B, batch 1, F32): the cache win is sequence-length-dependent —
  max_new=32   naive 108 vs cached 111 tok/s  (~1.0x; overhead-bound)
  max_new=128  naive  69 vs cached 133 tok/s  (~1.9x)
  max_new=256  naive OOM     vs cached 129 tok/s
Cached throughput stays ~constant (O(1)/token) while naive decays (O(t)/token,
O(seq^2) graph → OOM at length). Short eval prompts are overhead-bound, so the
cache matters for long rollouts (DPO/GRPO), not the arithmetic eval itself.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:00:03 +08:00