xtrain

Files

Gahow Wang 7a4f69e430 model: add per-head QK-norm (Qwen3-compat) for xserv export

xserv's Qwen3 forward unconditionally applies per-head RMSNorm to Q and K
(q_norm/k_norm, shape [head_dim]) before RoPE — even gamma=1 is a real RMS
divide, not identity. xtrain never had this, so an exact xserv<->xtrain loop
was structurally impossible. Add it (reusing the 2D rms_norm op on the
[seq*nh, hd] head rows, inserted between reshape and rope to mirror
qwen3.rs's order) so the trained model is genuinely Qwen3-compatible.

params() inserts q_norm,k_norm after wv; num_params() counts them; the
PyTorch parity refs (parity.py / adamw_parity.py) + their name lists add the
same step so the dumps stay self-consistent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 17:33:19 +08:00

src

perf: GPU AdamW + grad-norm

2026-06-15 16:53:09 +08:00

tests

model: add per-head QK-norm (Qwen3-compat) for xserv export

2026-06-15 17:33:19 +08:00

build.rs

data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip

2026-06-15 16:29:32 +08:00

Cargo.toml

train: loop + checkpoint save/load + sampler + train binary

2026-06-15 16:29:58 +08:00