xtrain

Files

Gahow Wang 7fb3b32fd9 post-train: M4 — GRPO actor-learner loop + cached temperature rollout

train_grpo: the online, critic-free RL loop — per step sample B prompts, roll
out G completions each, score with the rule-based checker (reward 0/1), compute
group-relative advantage A=(r−mean)/(std+ε), then K inner clipped_pg_loss
epochs with a KL leash to the frozen reference. Reward = pure 0/1 correctness
(KL is the format protector, the M3 collapse lesson). Tracks mean rollout reward
(the falsifiable "it learns" signal). Periodic checkpoint save.

decode: generate_cached adds temperature sampling to the KV-cache engine (M2) —
single-row [1,vocab] logits per step vs the naive sampler's [seq,vocab], far
lighter on the caching allocator (the naive sampler fragments it over a long
rollout). generate_greedy_cached now routes through it (temp 0); decode_kv
token-identical gate still passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 16:59:05 +08:00

xtrain-autodiff

post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op)

2026-06-30 14:07:02 +08:00

xtrain-cuda

post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op)

2026-06-30 14:07:02 +08:00

xtrain-distributed

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-model

post-train: M4 — GRPO actor-learner loop + cached temperature rollout