xtrain

Files

Gahow Wang 734e119db3 run: v4 archive + export (dim768, 8-GPU DDP, val 1.17)

v4 scaling run finished: dim768/18L, core 127.43M (total 204.63M), trained
720.9M tokens (~1.54 epoch) on 8x RTX 5090 DDP fp32, ~145K tok/s, ~84 min,
best val 1.1690. Checkpoint archived to registry
(~/projects/tiny-models/v4-tinystories-dim768/) and exported to xserv HF Qwen3
safetensors (201 tensors, BF16); xserv serves it and matches xtrain greedy
token-for-token on all 3 fixed prompts (40 tok).

Add `greedy_sample` bin: load a trained ckpt with its arch flags and print
xtrain's own greedy continuations for the fixed run prompts, so they can be
diffed against xserv's greedy on the exported weights (the per-run token-match
check). Same model/config/init scheme as bin/train.rs + bin/export_safetensors.rs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 13:14:28 +08:00

xtrain-autodiff

autograd: batch dim for ops (flatten linears, batched attention)

2026-06-16 00:44:15 +08:00

xtrain-cuda

cuda: device caching allocator (pool GpuBuffer alloc)

2026-06-16 11:04:02 +08:00

xtrain-distributed

test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8

2026-06-16 11:04:11 +08:00

xtrain-model

model: batched forward [B,S]