- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix +
regression test), with the meta-lesson that op/single-GPU unit tests can
miss launcher-level integration gaps — only the V9-PILOT end-to-end run on
the real launcher path exposed it.
- 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap
and its T21 fix.
- evolution.md: T21 row (Infra) recording the fix + meta-lesson.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:
README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
(Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
prose with a structured "## Phase 2" summary (5 features + honest results:
flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
KI-4 accepted tradeoff).
docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
per-axis (算法/架构/Infra/数据) narrative + the two integration notes.
docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
(b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)
Docs-only; no code touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale
dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute.
8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h.
Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 /
v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085
AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were
partly capacity-limited, scaling capacity > repeating old data. But the gain
is only ~3% (same magnitude as the data-axis single-step lever), and v8's
val was still descending at the end (not saturated).
Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6,
capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress,
scale capacity AND data together (Chinchilla, reproduced at toy scale).
- docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples
- docs/runs/README.md: +v8 row, v9 proposal
- docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).
Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.
Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>