xtrain

Files

Gahow Wang eff26a0898 post-train: M2a — KV-cache incremental decode engine (token-identical)

Single-sequence KV-cache decode (xtrain-model/src/decode.rs): per-layer K/V
cache + single-token incremental forward (prefill = first prompt.len() decode
steps, one code path). Mirrors model::block_forward at the raw-Tensor level (no
autograd tape — inference needs no grads), using rope_at + decode_attention.
Cache is host-accumulated token-major f32, rebuilt per step (the honest M2a
baseline; M2b moves it device-side + batched ragged).

Gate (the M2 centerpiece): KV-cache greedy decode is TOKEN-IDENTICAL to the
naive full-recompute greedy — tests/decode_kv.rs (small GQA model, F32, 24
tokens) and corroborated on the v12 1.05B SFT checkpoint (cached eval =
naive eval byte-for-byte: format 100/100, correct 8/100).

eval_arith --cached A/Bs the two paths + reports decode tok/s. Measured on v12
(1.05B, batch 1, F32): the cache win is sequence-length-dependent —
  max_new=32   naive 108 vs cached 111 tok/s  (~1.0x; overhead-bound)
  max_new=128  naive  69 vs cached 133 tok/s  (~1.9x)
  max_new=256  naive OOM     vs cached 129 tok/s
Cached throughput stays ~constant (O(1)/token) while naive decays (O(t)/token,
O(seq^2) graph → OOM at length). Short eval prompts are overhead-bound, so the
cache matters for long rollouts (DPO/GRPO), not the arithmetic eval itself.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 12:00:03 +08:00

xtrain-autodiff

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-cuda

post-train: M2 — decode primitives (rope_at + decode_attention)

2026-06-30 12:00:03 +08:00

xtrain-distributed

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-model

post-train: M2a — KV-cache incremental decode engine (token-identical)