xtrain

Files

Gahow Wang 86de6bfb51 distributed: T21-for-proc — wire --dropout into the process-per-GPU launcher

T21 fixed --dropout under thread-per-GPU (train_ddp): added the flag, set
cfg.dropout, and made train_rank re-assert model.train() each step so the
training forward stays live across periodic eval flips. The process-per-GPU
launcher (train_ddp_mp) was left out: it never parsed --dropout, so cfg.dropout
stayed at Config::from_arch's 0.0 default, and the worker's model built with
dropout permanently disabled — silently, regardless of what the user passed.

The gap is the exact same launcher-wiring class the V9-PILOT caught: op-level
+ single-GPU tests pass, the DDP-thread T21 regression test passes, but the
proc-per-GPU launcher path was never exercised end-to-end with dropout>0.

Mirror bin/train_ddp exactly: parse --dropout (default 0, bit-identical
default), set cfg.dropout before build_model, print an ON banner on rank 0.
train_rank's per-step model.train() from T21 is reused unchanged (proc-per-GPU
uses the same train_rank).

Follow-up test that exercises this wiring end-to-end (GATE B loss-trace
divergence between p=0 and p=0.2 under process-per-GPU) lands in the next
commit.

2026-07-01 13:51:17 +08:00

xtrain-autodiff

test: M2d — ragged-forward + batched-op equivalence gates + throughput bench

2026-06-30 23:03:09 +08:00

xtrain-cuda

post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift

2026-06-30 17:38:16 +08:00

xtrain-distributed

distributed: T21-for-proc — wire --dropout into the process-per-GPU launcher

2026-07-01 13:51:17 +08:00

xtrain-model

test: M2d — ragged-forward + batched-op equivalence gates + throughput bench