xtrain

Author	SHA1	Message	Date
Gahow Wang	5c27493a90	docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:18:48 +08:00
Gahow Wang	db70abe450	docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases) Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main: README.md - Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases" (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five deferred systems-stack features T14–T18). - Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU launcher in distributed). - Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18 prose with a structured "## Phase 2" summary (5 features + honest results: flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem, dropout × recompute bit-exact, T17 throughput-neutral falsification). - Engineering lessons: T17 added as the THIRD profile-first falsification; reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9. - Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED, KI-4 accepted tradeoff). docs/evolution.md - New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the per-axis (算法/架构/Infra/数据) narrative + the two integration notes. docs/known-issues.md - KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed loop; T19 DROPPED), not "open". - New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock); (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid determinism gate is export re-determinism, not fresh-train reproduction. - (process-per-GPU item was already CLOSED=measured no-op in T17.) Docs-only; no code touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 18:11:47 +08:00
Gahow Wang	71b0a1621f	docs: T17 process-per-GPU results — measured throughput-neutral Records the key empirical finding: process-per-GPU is statistically identical to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all 8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the old KI-5/T11 note speculated — measurement falsifies that hypothesis (same methodology as T11 falsifying "bucket the all-reduce"). Correctness all green: proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5 b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op); default training path unchanged. evolution.md Infra row + README T17 row + known-issues entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 18:03:14 +08:00
Gahow Wang	2ff4573a31	docs: T15 GQA results + evolution row (模型架构) + README build-journey row Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row + cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc index range (00..14). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:44:58 +08:00
Gahow Wang	f26db882e5	Merge t16-grad-accum into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # docs/evolution.md	2026-06-18 00:37:11 +08:00
Gahow Wang	8bd7db16e1	docs: T16 grad-accum results — evolution row + README build-journey dash5-verified gate numbers: accum=N bit-close to N× big batch (loss 8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big → 7.2GB accum, −74%), xserv closed loop md5-identical + token-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:52:32 +08:00
Gahow Wang	9064ced4c2	docs: T14 flash-attention results + evolution/README rows Fill in the design doc's measured results (grad-check, flash==composed, PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to evolution.md (算法/Infra) and the README build-journey table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:34:10 +08:00
Gahow Wang	31cc2bf745	docs: capstone README — full-stack + scaling study (v0-v8) writeup Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:17:26 +08:00
Gahow Wang	92acf9f413	T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test) Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:42:43 +08:00

9 Commits