docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)

Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:

README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
  (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
  deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
  in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
  launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
  prose with a structured "## Phase 2" summary (5 features + honest results:
  flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
  dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
  reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
  KI-4 accepted tradeoff).

docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
  per-axis (算法/架构/Infra/数据) narrative + the two integration notes.

docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
  loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
  (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
  determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)

Docs-only; no code touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-18 18:11:47 +08:00
parent 71b0a1621f
commit db70abe450
3 changed files with 105 additions and 38 deletions

109
README.md
View File

@@ -6,10 +6,14 @@ inference side). A learning project: hand-write the entire training-systems stac
gradient checkpointing), then use it to run a multi-version **scaling study** that maps
the data-vs-capacity frontier for a tiny model.
> **Status: complete.** From-scratch full stack (phases T1T13) + an 8-version scaling
> ladder (v0v8). Trains a Qwen3-compatible LM whose weights load into **xserv** and
> generate **token-identical** output. This README is the capstone; per-topic detail
> lives in [`docs/`](docs/).
> **Status: complete — two phases.**
> **Phase 1** = the from-scratch full stack (T1T13) + an 8-version scaling study (v0v8):
> hand-write the whole training-systems stack, then map the data-vs-capacity frontier.
> **Phase 2** = systems-stack depth (T14T18): hand-write the five deferred training-stack
> features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP,
> dropout. Trains a Qwen3-compatible LM whose weights load into **xserv** and generate
> **token-identical** output — the closed loop held byte-for-byte across both phases. This
> README is the capstone; per-topic detail lives in [`docs/`](docs/).
---
@@ -22,20 +26,22 @@ borrows, the rest hand-written CUDA + Rust:
|---|---|
| `xtrain-cuda` | CUDA Runtime FFI, RAII `GpuBuffer`, **caching/pool allocator**, cuBLAS (sgemm + bf16 GemmEx) bindings |
| `xtrain-tensor` | tensor (dtype/shape/strides/storage), elementwise + transpose + embedding kernels |
| `xtrain-autodiff` | **tape autograd engine** (grad accumulation), per-op backward, finite-diff grad-check, **checkpoint** (recompute) primitive |
| `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward |
| `xtrain-autodiff` | **tape autograd engine** (grad accumulation), per-op backward, finite-diff grad-check, **checkpoint** (recompute) primitive, **fused flash-attention** (online-softmax) fwd/bwd, **`repeat_kv`** broadcast (GQA), **`dropout`** (counter-based device RNG + mask) |
| `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward, **GQA** (`num_kv_heads<num_heads`), residual/MLP **dropout** |
| `xtrain-optim` | hand-written **AdamW** (host + GPU kernels) |
| `xtrain-train` | training loop, LR schedule, grad clip, checkpoint, BPE corpus + cache, samplers, safetensors export |
| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU, all-reduce) |
| `xtrain-train` | training loop, LR schedule, grad clip, **gradient accumulation**, checkpoint, BPE corpus + cache, samplers, safetensors export |
| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU launcher / cross-process `ncclUniqueId`, all-reduce) |
Every op's backward is verified against **finite differences** and against **PyTorch**
(forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
load into xserv (Qwen3, BF16) producing token-identical greedy output — the closed loop.
## The build journey — phases T1T13
## The build journey — Phase 1 (T1T13) + Phase 2 (T14T18)
Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`](docs/) and
[`docs/evolution.md`](docs/evolution.md) for the per-axis changelog).
[`docs/evolution.md`](docs/evolution.md) for the per-axis changelog). **Phase 1 (T1T13)**
hand-built the stack and fixed the four real bottlenecks; **Phase 2 (T14T18)** went back to
hand-write five deferred training-stack features — see the Phase-2 summary below the table.
| phase | what | result |
|---|---|---|
@@ -57,22 +63,48 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |
The four performance fixes (T10T13) each removed a real bottleneck — see
[`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14)**
revisits hand-writing deferred training-stack features: T14 = the fused
flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md));
T15 = real grouped-query attention ([`docs/14-gqa.md`](docs/14-gqa.md), `num_kv_heads <
num_heads` via a `repeat_kv` broadcast op whose backward sums each kv head's query-head
group — feeding both SDPA paths unchanged, default MHA bit-identical);
T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
which decouples the effective batch from activation memory (memory tracks the micro-batch,
not N×); T17 = torchrun-style process-per-GPU DDP
([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md), one process + CUDA context per
GPU, launcher-minted `ncclUniqueId` via env injection, reusing the T8 training step
unchanged) — which **measured** that, at this scale, separate contexts give no throughput
gain over thread-per-GPU (the residual ~5.3×@8 is the NCCL/PCIe communication wall, not
single-context serialization as the old KI-5 note speculated); T18 = dropout
([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based device RNG + mask, inverted
scaling, train/eval switch).
[`docs/known-issues.md`](docs/known-issues.md) — which is where **Phase 1** closed.
## Phase 2 — systems-stack depth (T14T18)
Phase 1 fixed bottlenecks; Phase 2 went back to hand-write the five training-stack features
that had been **explicitly deferred** earlier (project's actual goal = learn the whole stack).
Each is opt-in, kept the default path **bit-identical**, and held a **hard correctness gate**:
- **T14 · fused flash-attention** ([`docs/13-flash-attention.md`](docs/13-flash-attention.md)) —
a single hand-written kernel: **online (streaming) softmax, tiled over KV, never materializes
the `N×N` scores**; flash-style backward recomputes scores + the `D=ΣdO·O` Jacobian
simplification for dQ/dK/dV. Opt-in `--flash`, default off. **The win is memory, not
wall-clock**: peak activation **16%@seq1024 / 23%@seq2048** (grows with seq, since the
`N×N` never lands), but **~2.3× slower** at head-dim 64 (a hand kernel can't beat cuBLAS
tensor-cores on a small head). Gate: flash == composed (loss rel `0.0`, grad `4.4e-5`),
PyTorch B>1 `7.9e-6`.
- **T15 · real GQA** ([`docs/14-gqa.md`](docs/14-gqa.md)) — `num_kv_heads < num_heads` via a new
`repeat_kv` **broadcast op** that copies K/V `group = nh/num_kv` times to feed **both**
(composed + flash) SDPA paths **unchanged**; its **backward is a deterministic group-sum**
(no atomics) collapsing each kv head's query-head group. Gate: `repeat_kv` grad-check +
**group=1 bit-identical to MHA** (regression guard); **xserv closed loop with real
`num_key_value_heads`** token-identical.
- **T16 · gradient accumulation** ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)) — N
micro-steps scaled by `1/N` accumulate on the tape, then one AdamW step; DDP **all-reduces
only at the accumulation boundary**. Decouples effective batch from activation memory: same
effective batch 64, big-batch **27.7GB (OOM)** → accum 4×16 **7.2GB (74%)**. Gate: `accum=N`
≡ one N× batch (grad `3.8e-5`); `accum=1` bit-identical.
- **T18 · dropout** ([`docs/17-dropout.md`](docs/17-dropout.md)) — a **stateless counter-based
device RNG** (Philox-style bit-mix) → Bernoulli mask, inverted `1/(1p)` scaling in train,
identity in eval; wired at the two residual sites (attn-out, mlp-out). Stateless RNG is what
makes it **compose bit-exactly with T13 activation recompute** — the backward re-run
regenerates the *same* mask from `(seed, index)`. Gate: fixed-seed grad-check; **p=0
bit-identical**.
- **T17 · process-per-GPU** ([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md)) — a
torchrun-style launcher: one worker process + CUDA context per GPU, the launcher mints one
`ncclUniqueId` and **hex-injects it into each child's env** (no shared FS/TCP, no race); the
worker reuses the T8 `train_rank` **unchanged**. Built and **correct** (proc vs thread loss
`1.5e-7`, cross-rank `1.2e-7`, xserv md5 identical) — but **measured throughput-neutral**:
8-GPU thread **491K (5.27×)** vs proc **493K (5.31×)**, `<1%`. This **falsifies** the
long-standing KI-5/T11 hypothesis that thread-per-GPU's shared CUDA context caused the
residual ~5×@8; with all 8 GPUs at 9599% util, the residual is the **NCCL all-reduce + PCIe
topology wall**, not context serialization. The third profile-first falsification (see below).
## The scaling study — v0 → v8
@@ -122,14 +154,19 @@ versions — a fixed-MFU estimate is off by up to ~100× for the early launch-bo
## Engineering lessons
- **Profile before optimizing.** Two "known" perf fixes were *falsified by measurement*
before being shipped: "bigger batch fixes DDP scaling" (real cause: single-seq
launch-bound T10) and "bucket the all-reduce" (real cause: per-op `cudaMalloc`
serialization T11 caching allocator). Both would have been no-ops; both got reverted +
re-diagnosed instead of shipped.
- **Honest correctness.** QK-norm was *added* to match xserv's Qwen3 (not faked); every perf
change kept a hard correctness gate (recompute grads bit-identical; bf16 keeps the fp32
path untouched; the full grad-check / PyTorch / DDP / xserv suite must stay green).
- **Profile before optimizing.** *Three* "known" fixes were *falsified by measurement*: (1)
"bigger batch fixes DDP scaling" (real cause: single-seq launch-bound T10); (2) "bucket the
all-reduce" (real cause: per-op `cudaMalloc` serialization T11 caching allocator); and (3)
"process-per-GPU would fix the residual ~5×@8" (T17 built the torchrun-style launcher and
measured it **throughput-neutral**: the residual is the NCCL/PCIe communication wall, not
shared-context serialization). All three would have been no-ops; each got measured and either
reverted or recorded as a deliberate negative result instead of shipped on faith.
- **Honest correctness.** QK-norm was *added* to match xserv's Qwen3 (not faked); every change
kept a hard correctness gate, and **no tolerance was ever loosened to go green**. Phase 2 held
the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient
accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute
bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5
byte-identical (`b04fc9f9`) throughout both phases**.
- **The closed loop matters.** Exporting to xserv and checking token-identical greedy output
caught real bugs and proved the whole stack end-to-end.
@@ -156,5 +193,5 @@ cargo test --workspace # autograd grad-checks, PyTorch parity, DDP, e
- [`docs/evolution.md`](docs/evolution.md) per-milestone changes across algorithm / architecture / infra / dataset.
- [`docs/runs/README.md`](docs/runs/README.md) the v0v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) per-run detail.
- [`docs/00-*` … `14-*`](docs/) per-phase design docs (build chain tensor autograd transformer training perf distributed export batched allocator bf16 recompute flash-attention GQA).
- [`docs/known-issues.md`](docs/known-issues.md) perf backlog (KI-1/2/3/5 fixed; KI-4 + process-per-GPU open).
- [`docs/00-*` … `17-*`](docs/) per-phase design docs (build chain tensor autograd transformer training perf distributed export batched allocator bf16 recompute flash-attention GQA grad-accum process-per-GPU dropout).
- [`docs/known-issues.md`](docs/known-issues.md) perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff).