xtrain

Files

Gahow Wang 361c5290fa post-train: M4 — use M2b batched rollout in GRPO (~1.7× step)

train_grpo rolls out a prompt's G samples with one generate_cached_batch call
instead of G sequential generate_cached calls. Measured on v12 1.05B (G=6, B=6,
easy task): ~8.5 s/step vs ~14-16 s/step single-seq cached — ~1.7× (rollout-
inclusive; short of G× because per_token_logp + the PG update also cost, and the
M2a host round-trip remains). Also more stable memory: one batched forward per
step vs G allocations that fragment the caching allocator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 17:18:54 +08:00

xtrain-autodiff

post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op)

2026-06-30 14:07:02 +08:00

xtrain-cuda

post-train: M2b — batched KV-cache decode (G-way, token-identical)

2026-06-30 17:18:54 +08:00

xtrain-distributed

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-model

post-train: M2b — batched KV-cache decode (G-way, token-identical)