Commit Graph

124 Commits

Author SHA1 Message Date
db70abe450 docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:

README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
  (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
  deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
  in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
  launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
  prose with a structured "## Phase 2" summary (5 features + honest results:
  flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
  dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
  reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
  KI-4 accepted tradeoff).

docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
  per-axis (算法/架构/Infra/数据) narrative + the two integration notes.

docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
  loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
  (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
  determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)

Docs-only; no code touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:11:47 +08:00
71b0a1621f docs: T17 process-per-GPU results — measured throughput-neutral
Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:03:14 +08:00
4abb17383a test: process-per-GPU DDP correctness (ddp_proc.rs)
Self-launching test: worker mode (XTRAIN_RANK set) trains on synthetic corpus
and dumps loss+params; launcher mode runs single-GPU baseline + thread-per-GPU
launch + spawns 2 worker processes, then asserts (a) proc loss == single-GPU
<1e-3, (b) cross-rank params <1e-6 (KI-5 ULP), (c) proc loss == thread-per-GPU
<1e-3. Run with --test-threads=1 (distributed harness property).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:52 +08:00
a188c8a277 distributed: train_ddp_mp bin (process-per-GPU launcher/worker)
Dual-mode binary self-detecting via XTRAIN_RANK: launcher spawns one worker
per visible GPU forwarding full argv; worker rebuilds config from argv and runs
run_worker. CLI flags identical to train_ddp (thread-per-GPU, kept), so it
doubles as the before->after throughput driver. thread-per-GPU path untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:52 +08:00
ffd548b80b distributed: process-per-GPU launcher + worker (proc.rs)
torchrun-style process-per-GPU: launch_processes spawns one worker process per
GPU (re-exec current_exe with XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID} env),
mints the ncclUniqueId once in the launcher and hex-injects it via env (no
shared FS/TCP, race-free). worker_env/run_worker read the env, bind the device
(own CUDA context), DdpContext::init + build_model + train_rank reused from T8
UNCHANGED. hex_encode/decode_unique_id are host-testable pure fns.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:43 +08:00
c470c627a7 docs: Phase T17 — process-per-GPU DDP design
torchrun-style: launcher spawns N worker processes, each with its own CUDA
context; cross-process ncclUniqueId distributed via launcher-minted hex env
injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused
unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped
(user scope decision).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:44:38 +08:00
2ff4573a31 docs: T15 GQA results + evolution row (模型架构) + README build-journey row
Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row +
cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc
index range (00..14).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:44:58 +08:00
39df0b40c1 gqa: fix kv-proj shape test param indices (embed,attn_norm precede wq)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:38:42 +08:00
830d06ad01 gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests)
- repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each
  kv head sums its group of query-head grads; no atomics) + Tensor/ops node.
- Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim;
  attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed
  & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical.
- --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export
  writes real num_key_value_heads (xserv repeat_kv grouping aligned).
- Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs
  (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape);
  parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:37:37 +08:00
62b1cb5dc7 docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:30:34 +08:00
4b6d3e0a79 test: flash+dropout cross-feature grad-check (Phase-2 integration)
Add flash_plus_dropout_grad_check_fp32 to xtrain-model dropout tests: the two
orthogonal Phase-2 features (T14 flash-attn, T18 dropout) in the same model must
still grad-check. Both models run train-mode p=0.2 (identical masks, seed is
flash-independent) so the only delta is the SDPA reduction order — checked against
the flash-vs-composed tolerance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:43:54 +08:00
c36cdf74d1 Merge t18-dropout into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	crates/xtrain-autodiff/tests/autograd.rs
#	crates/xtrain-model/src/model.rs
#	crates/xtrain-train/src/bin/train.rs
#	crates/xtrain-train/src/train_loop.rs
#	docs/evolution.md
2026-06-18 00:41:41 +08:00
f26db882e5 Merge t16-grad-accum into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	docs/evolution.md
2026-06-18 00:37:11 +08:00
9e958cb0f9 Merge t14-flash-attention into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:35:46 +08:00
80fafa1914 docs: T18 evolution row + README build-journey row (dropout)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:06:06 +08:00
e625aa05dd dropout: wire into model (residual sites) + train/eval switch + flag (T18)
Config.dropout (default 0). TinyTransformer gets a Cell<bool> training switch
(train()/eval()/with_training, default eval = safe) + a Cell<u64> step_seed bumped
once per training forward. forward_batched derives a per-layer block_seed (pure fn
of step_seed×layer) and block_forward derives two per-site seeds, inserting
ops::dropout at the attn and ffn sub-block outputs (before each residual). The
seed is a pure function of (step_seed, layer, site) so the checkpoint (T13)
recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout
node → graph bit-identical to pre-T18.

train_loop: model.train() per step (restored after eval flips to eval); eval_loss
runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in
eval (default), so exported weights are dropout-free (xserv closed loop unaffected).

Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads);
eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout
grads match non-recompute (fp32 + bf16).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:32 +08:00
5eb27783f8 dropout: autodiff op + fixed-seed grad-check (T18)
ops::dropout(x,p,seed): fwd runs Tensor::dropout, caches the mask in the backward
closure, bwd pushes dx=d⊙mask. p==0 returns x.clone() (no node) so the default
graph is unchanged. Tests in autograd.rs: fixed-seed finite-diff grad-check (mask
held constant across the ± perturbation — dropout is a fixed elementwise linear
map of x); E[out]≈input + keep-rate≈1-p over a seed sweep; p=0 kernel identity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:32 +08:00
1fdd0c5002 dropout: device RNG kernel + Tensor fwd/bwd (T18)
csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32
uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer
(per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16
activation variants (mask fp32 in both; uniform is dtype-independent so masks
match across precisions). Stateless → re-run with same seed = same mask (T13
recompute-safe). Registered in build.rs + FFI decls.

Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the
launches (contiguous F32/BF16, default stream, per-op sync via the kernels).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:18 +08:00
6b8c1e4e0f docs: Phase T18 — dropout design (device RNG + mask)
Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p)
scaling at train, identity at eval. New autodiff `dropout` op (fwd generates +
applies mask, bwd applies the SAME cached mask). Wired at the two residual-path
sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused
SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic
seed (so T13 recompute reproduces the same mask), train/eval switch, p=0
bit-identity, and the acceptance gates.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:08 +08:00
8bd7db16e1 docs: T16 grad-accum results — evolution row + README build-journey
dash5-verified gate numbers: accum=N bit-close to N× big batch (loss
8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches
single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big →
7.2GB accum, −74%), xserv closed loop md5-identical + token-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:52:32 +08:00
b06b553f99 test: drop unused Var import in grad_accum
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:49:04 +08:00
abe5ceb913 test: grad-accum equivalence + accum=1 bit-identity + DDP+accum
- grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch;
  accum_steps=1 bit-identical (max|Δ|==0) to no-accum; real train() loop
  with accum tracks a big-batch baseline over 20 AdamW steps.
- ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of
  the same effective size (loss + cross-rank + vs-baseline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:45:40 +08:00
7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps)
Accumulate grads over N micro-batches, then one AdamW step + zero_grad,
for an effective batch of N×micro at one micro-batch's activation cost.
Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates
the scaled grads) so the boundary grad equals a single step over an N×
batch. accum==1 skips the scale → bit-identical to the pre-T16 path.

DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary
(intermediate micro-steps are local-only, no NCCL); the /world average is
orthogonal to the per-micro 1/N, so the boundary grad is the effective
global-batch mean. New --accum-steps flag in both train binaries; effective
batch is printed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:45:33 +08:00
d01fec6639 docs: Phase T16 — gradient accumulation design
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:41:17 +08:00
9064ced4c2 docs: T14 flash-attention results + evolution/README rows
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:34:10 +08:00
d217f4fbd3 perf: spread flash bwd dK/dV atomics across all threads
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:27:33 +08:00
4d7b69f8d4 perf: cache softmax weights in shared mem (drop hd× redundant expf)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:24:56 +08:00
9b05f4f93f test: flash==composed bf16 uses robust mean/p99 metric (repo convention)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:19:08 +08:00
c0f0b67510 test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:17:44 +08:00
80602099dc test: scale Q/K in flash grad-check for well-conditioned grads
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:17:04 +08:00
f38beb0346 test: flash finite-diff grad-check uses single-tile clean regime
Match the trusted composed grad-check dims (seq=5<FA_TILE); the multi-tile
online-softmax path is gated by flash_bwd_matches_composed_bwd (seq=40),
sharper than finite-diff on the near-zero grads a long softmax produces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:16:20 +08:00
01fb22d114 test: flash bwd vs composed bwd (sharper than finite-diff)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:12:30 +08:00
5f3b81ac96 test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag
autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile)
+ flash_matches_composed_fwd. model/tests/flash.rs: flash==composed
on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump:
XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle
(PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:39 +08:00
0e20821633 autodiff+model: flash-attention op + --flash opt-in wiring
ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of
O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a
use_flash bool + with_flash(bool) builder; the SDPA core in attention()
picks ops::flash_attention vs ops::attention. flash threads through
block_forward so the recompute (T13) segment also runs flash. Default
off = composed path, graph unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:32 +08:00
326a6fadfe cuda: fused flash-attention kernel (fwd + flash-style bwd)
csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per
query row, streams KV in tiles of 32, online softmax — running max/sum
+ rescaled V accumulator, causal mask inlined, never materializes the
[bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N),
saved for backward). flash-style bwd: recompute scores from Q/K/V + L,
collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV
atomicAdd across rows. Tensor::flash_attention / flash_attention_backward
wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax
policy as composed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:25 +08:00
65a2264227 docs: Phase T14 — fused flash-attention design
Design doc for the hand-written single fused flash-attention kernel:
online softmax tiled over KV, NEVER materializing the [bh,S,S] score
matrix; flash-style backward (recompute scores from saved logsumexp +
D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:16 +08:00
31cc2bf745 docs: capstone README — full-stack + scaling study (v0-v8) writeup
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:17:26 +08:00
511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98)
v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale
dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute.
8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h.

Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 /
v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085
AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were
partly capacity-limited, scaling capacity > repeating old data. But the gain
is only ~3% (same magnitude as the data-axis single-step lever), and v8's
val was still descending at the end (not saturated).

Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6,
capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress,
scale capacity AND data together (Chinchilla, reproduced at toy scale).

- docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples
- docs/runs/README.md: +v8 row, v9 proposal
- docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 15:12:01 +08:00
0150263055 perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16,
batch32 seq256, steady-state):

- Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL —
  fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within
  tol", exactly equal). Full suite green with recompute on/off; DDP loss-match
  5.67e-7; DDP+recompute 2-rank descends 11.079→6.010.
- dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s
  39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band).
- dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607
  MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges.
  → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed.

Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:50:29 +08:00
69c5f07359 docs: Phase T13 — activation recompute
Design doc for per-block gradient checkpointing (KI-3): the no-tape forward +
recompute-on-backward design, the `checkpoint` primitive, per-block wrapping,
the exactness/correctness argument (same kernels + inputs → identical grads),
composition with bf16+DDP+batched, and the verification plan (on-vs-off grad
gate + memory/throughput before→after, dim1024-fits). Bench table left as TBD
to fill after the dash5 run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:45:16 +08:00
f202351be5 model: per-block activation recompute (--recompute)
Wrap each transformer block's forward in the checkpoint primitive when
recompute is enabled (Phase T13 / KI-3). To make the block forward a pure
segment fn (no `&self` borrow, so it can re-run in the backward closure),
extract the block body + its helpers (linear / norm_gamma / attention /
swiglu_mlp) into free functions parameterised by (cfg, compute_dtype) and add
`Block::block_params()` (the 11 leaves in the params() per-block order). The
non-recompute path calls `block_forward` directly — identical graph to before.

- `TinyTransformer::with_recompute(bool)` builder (opt-in; default off keeps the
  unchanged tape / bit-identical numerics).
- `--recompute` flag wired into bin/train and bin/train_ddp (DDP: each rank
  checkpoints independently).

Correctness gate: tests/recompute.rs builds two identical models (recompute
on/off), runs the same batched loss+backward, and asserts the forward logits,
the loss, and EVERY parameter grad match within tight fp tol — parameterised
over fp32 and bf16 (T12 composition).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:42:42 +08:00
c396b39483 autodiff: checkpoint primitive (recompute-on-backward)
Add `xtrain_autodiff::checkpoint::checkpoint(segment_fn, input, params)`, a
higher-order autograd node (à la torch.utils.checkpoint) for activation
recomputation (Phase T13 / KI-3):

- forward: run `segment_fn` on detached leaves so its internal ops are NOT
  recorded on the outer tape; keep only the output value (the local sub-tape —
  and thus the segment's intermediate activations — drops immediately). The
  checkpoint node's parents are [input, ..params].
- backward: re-run `segment_fn` from the saved input + (unchanged) param values
  into a fresh local tape, seed the recomputed output with the upstream grad,
  backprop, then push the recovered input/param grads to the real parents. Local
  tape drops at the end → recomputed activations freed.

Exact by construction (same deterministic kernels, same inputs) → grads match
the non-checkpointed path. Composes with bf16 (T12, same path on recompute) and
DDP (T8, per-rank).

Supporting change: `Var::backward_seeded(seed)` — backward from an explicit
non-scalar upstream grad (the segment output is generally not a scalar);
`backward()` is now the scalar wrapper that seeds ones.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:42:31 +08:00
9c557f0609 docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01)
v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).

Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.

Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 03:55:47 +08:00
b4bb426d48 docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution)
第一版脱离 TinyStories:纯 FineWeb-edu 真实网页文本(2.255B 语料),架构同
v4/v5(dim768/18L, core 127.43M),8 卡 DDP bf16,2.29B tok/1.02ep,~1.9h
@218K tok/s。train 11.03→3.14,best/final FineWeb val 3.0652。

方法论:FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实
网页熵高,~3.0 是预期非回退;判据是采样质量 + transfer eval。

- 新增 docs/runs/06-v6-fineweb-edu-dim768.md:数据管线(scripts/fineweb_to_txt.py)
  / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) /
  方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用
  数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律
  掉进小故事)
- README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案
- evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 22:21:43 +08:00
88bec270af docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:52 +08:00
7e5ea9976b data: FineWeb-edu parquet->txt prep script (Scaling v6)
v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu
sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams
the parquet text column row-group by row-group and joins docs with
<|endoftext|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles
it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:04:45 +08:00
579365f4a0 docs: run v5 — TinyStories saturation at dim768 (val 1.11)
设计文档 05-v5-tinystories-dim768.md(中文,xserv 风格):数据 2.49B tok/5.33ep、
架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。
核心发现「数据天花板」:v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平
⇒ TinyStories 在 dim768/127M-core 近饱和,v6 该换轴(更大模型/更广语料,非更多 TinyStories)。
xserv BF16 服务 3/3 prompt 逐 token 一致。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
8a1e29543b run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11)
v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列,让 TinyStories 数据饱和可见
(v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平)。下一档提案改为 v6 换轴。
导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768(checkpoint/safetensors 不入库)。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
5b7dde1736 test: bf16 test reads f32-cast logits (forward now returns bf16)
The `keep bf16 logits` change made forward_batched return bf16 logits
in bf16 mode; the bf16 test's host read must cast to f32 first.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:29:24 +08:00
320c1ae4fb perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K
bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32
OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX
5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state):

  config          batch  peak mem   tok/s   fits 32GB
  fp32            16      27.2 GB    31.5K   yes
  bf16            16      19.3 GB    35.5K   yes   (-29% mem / +13% tok/s)
  fp32            32      —          —       OOM
  bf16            32      31.1 GB    40.8K   yes   (+29% vs fp32-b16)

Verified on dash5: fp32 suite green at tight tol + xserv export md5
bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99
6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs
3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly.

Mark KI-2 FIXED; fill docs/11 results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:28:20 +08:00