xtrain

Author	SHA1	Message	Date
Gahow Wang	264660527f	docs: run v1 — TinyStories full, dim256 docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table. v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M). Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:09:46 +08:00
Gahow Wang	8981cf7982	docs: T9 verification results (xserv == xtrain, dash5) Capture the closed-loop run: train (loss 10.84->3.59) -> export (47 tensors, BF16) -> xserv dump-logits + greedy. Top-1 + top-11 token order identical, logits within ~1e-2 (BF16-vs-f32 drift), greedy generation token-for-token identical across two prompts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:37:46 +08:00
Gahow Wang	18c2229b4b	docs: Phase T9 — export to xserv Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the QK-norm structural decision + BF16 acceptance criterion, the tensor-name + layout mapping table, and the dash5 closed-loop verification recipe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:32 +08:00
Gahow Wang	0131f05b26	docs: Phase T8 — distributed data parallel Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then- local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same init/state), batch sharding math vs single-GPU, verification plan + scaling table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:49 +08:00
Gahow Wang	5e8add2a41	docs: Phase T7 — performance Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd (row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step param/grad roundtrip), drop per-op sync + device memset. Includes the verification table (regression suite green + tok/s 2770→8220 ~3x), the deferred bf16/recompute follow-up rationale, and the T8 all-reduce note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:00:29 +08:00
Gahow Wang	29b4d30b6c	docs: Phase T6 — training loop Design doc for the T6 training stack: Goal / Module Layout / Key Design Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse, sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host unit tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:14 +08:00
Gahow Wang	8565565647	docs: Phase T5 — tiny transformer Goal / Module Layout / Key Design Decisions (multi-head layout via reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add, x@W convention, causal mask, params API, overfit methodology) / 验证方法 with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	777f3c7949	docs: Phase T4 — autograd engine Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	dde2fde297	docs: Phase T3 — GEMM fwd/bwd + finite-diff Design doc covering the tiled forward, the dA/dB math + how transpose is handled (materialize + reuse forward), the cuBLAS row-major reference, and the finite-diff harness design + how T4 reuses it per-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:27:03 +08:00
Gahow Wang	8557a289a2	docs: Phase T2 — tensor abstraction Design doc for the minimal tensor layer: DType/shape/Storage/Tensor, host↔device copy, and one elementwise kernel (scale) wired end-to-end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00
Gahow Wang	c1b204296b	docs: backfill T1 build-chain T1 shipped without a design doc; capture the Rust↔CUDA build chain (build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00

11 Commits