xtrain

Files

Gahow Wang 3a3425960c post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift

Device-resident KV cache: keep K/V on the GPU as [bh,T,hd], grow by one token
per step via a new cat_seq kernel (concat along seq) — removes the M2a/M2b
per-layer host round-trip (to_cpu/from_slice/re-upload) AND the transpose_3d01.
Both single-seq and batched decode refactored to it; cache is Option<Tensor>
per layer (cleaner than the host Vec + rebuild).

Gates all hold: cat_seq == host concat; decode_kv single-seq + decode_batch
G-way both still TOKEN-IDENTICAL; GQA training path unaffected.

Honest measurement (the point): removing the host round-trip buys ~10% on pure
single-seq decode (133 → 147 tok/s @128) but does NOT move the GRPO step
(~8.5 s/step unchanged) — because after M2b batching the rollout is no longer
the step's bottleneck; the per-sample per_token_logp captures + the PG-update
forwards/backwards (model.forward, full-seq) now dominate. Measure-first lesson
(cf. T11/T17/M2a): the long pole shifted to the training-side forwards; the next
decode lever (ragged batched prefill) targets those, not the cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 17:38:16 +08:00

src

post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift

2026-06-30 17:38:16 +08:00

tests

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00

build.rs

gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests)

2026-06-18 01:37:37 +08:00

Cargo.toml

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00