Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:
README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
(Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
prose with a structured "## Phase 2" summary (5 features + honest results:
flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
KI-4 accepted tradeoff).
docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
per-axis (算法/架构/Infra/数据) narrative + the two integration notes.
docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
(b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)
Docs-only; no code touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>