docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main: README.md - Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases" (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five deferred systems-stack features T14–T18). - Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU launcher in distributed). - Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18 prose with a structured "## Phase 2" summary (5 features + honest results: flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem, dropout × recompute bit-exact, T17 throughput-neutral falsification). - Engineering lessons: T17 added as the THIRD profile-first falsification; reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9. - Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED, KI-4 accepted tradeoff). docs/evolution.md - New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the per-axis (算法/架构/Infra/数据) narrative + the two integration notes. docs/known-issues.md - KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed loop; T19 DROPPED), not "open". - New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock); (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid determinism gate is export re-determinism, not fresh-train reproduction. - (process-per-GPU item was already CLOSED=measured no-op in T17.) Docs-only; no code touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -156,6 +156,20 @@ _(KI-3 已在 T13 FIXED,见上方 Fixed。本节暂无待启项。)_
|
||||
|
||||
## Modeling notes
|
||||
|
||||
### KI-4 · 大词表 embedding 占比过高
|
||||
### KI-4 · 大词表 embedding 占比过高 — `接受的建模权衡(用户拍板,T19 DROPPED)`
|
||||
- gpt2 `vocab=50257` 在 dim 小时让 embed+lm_head 主导参数:v1 25.7M/34M、v2 38.6M/66.9M;core transformer 才是学习主体。
|
||||
- 后续可考虑更贴合 TinyStories 的小 vocab(会牺牲 xserv gpt2-tokenizer 复用);或在更大 dim 下让 core 自然成为主体(继续 scaling 即可缓解占比)。
|
||||
- **决定(不是 open,是一个被接受的权衡)**:曾计划 T19 训一个更贴语料的小 vocab 来压 embed 占比,**用户拍板 DROP**——保住 **xserv gpt2-tokenizer 闭环**(项目皇冠:导出权重回流 xserv 逐 token 一致)比清理 embedding 占比更值。小 dim 下 embed 占比高=复用 gpt2 大词表的**已知、接受的代价**。
|
||||
- 缓解路径仍在:更大 dim 时 core 自然成为主体(继续 scaling 即可摊薄占比,v8 dim1024 core 226M 已主导);若日后愿意放弃该版闭环再重启小词表(见 [`~/toc/projects/xtrain.md`](../../toc/projects/xtrain.md) T19)。
|
||||
|
||||
---
|
||||
|
||||
## 集成 / 测试注记(pre-existing,非回归,记账)
|
||||
|
||||
### DDP 三测并行会争卡 deadlock → `--test-threads=1`
|
||||
- `xtrain-distributed` 的三个 DDP 测试(thread-per-GPU correctness / scaling、process-per-GPU `ddp_proc`)若被 cargo **并行**跑,会在共享的 2 卡上互相争 GPU/NCCL 资源 **deadlock**(不是数值 bug,是测试调度)。
|
||||
- **跑法**:`cargo test ... -- --test-threads=1`(或把 DDP 测试标 serial)串行跑即全绿。Phase-2 全回归 capture 均在 `--test-threads=1` 下取得。
|
||||
|
||||
### fresh-train md5 run-to-run 不定 → 有效确定性闸门是「导出重确定性」而非「fresh-train 复现」
|
||||
- **现象**:从随机初始化全新训练(fresh-train)产出的 ckpt,其 md5 **run-to-run 不逐位复现**。
|
||||
- **根因**:反向里多处 `atomicAdd` 归约(如跨行 dK/dV、扇出累加)的浮点**加法序非确定**(GPU 原子操作完成序不固定)→ 末位 ULP 抖动逐步累积 → ckpt 字节不同。属本机/本版的已知浮点非确定性,**不是正确性回归**(loss 轨迹仍同量级收敛)。
|
||||
- **因此项目的确定性硬闸门定义为「导出(export)重确定性」**:拿**同一个已固定的 ckpt** 重新导出 HF-safetensors,md5 与 registry **逐位一致**(`b04fc9f9`,两阶段全程)——这条是确定性的、承重的;**不要求** fresh-train 字节复现。
|
||||
|
||||
Reference in New Issue
Block a user