Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.
Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:
README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
(Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
prose with a structured "## Phase 2" summary (5 features + honest results:
flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
KI-4 accepted tradeoff).
docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
per-axis (算法/架构/Infra/数据) narrative + the two integration notes.
docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
(b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)
Docs-only; no code touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's
csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA
Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu
via the cc crate targeting sm_120 (RTX 5090) and links cudart.
A gated integration test runs the vector-add kernel on the GPU and asserts the
result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA
compilation and sets a `no_cuda` cfg so host-side cargo check still works.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>