59 Commits

Author SHA1 Message Date
4379868f2d docs: M2d — ragged-batching lever, 9× measured, step bottleneck → rollout
Records the M2d lever (batch the GRPO training-side forwards), the right-pad-is-free
insight, both exact gates, the end-to-end no-OOM smoke, and the 9× throughput.

The honest decomposition correction: M2c claimed the training forwards "dominate" the
step; the clean per-component bench falsifies the strong form — they were ~2.5 s of
the ~8.5 s step (~30%), worth the 9×, but the rollout (~6 s) was always the larger
share. After M2d the step is ~95% rollout, so the next step-level lever is full B×G
rollout batching (today only the G samples of each prompt decode in lockstep; the B
prompts are still sequential). Same measure-first lesson, once more.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 23:03:28 +08:00
41d46208a6 docs: M2c — device KV cache + the bottleneck-shift finding
Implementation log (docs/18) + Phase-3 row (evolution.md): cat_seq device cache,
gates hold (token-identical), and the profile-first finding — ~10% single-seq
decode but no GRPO-step change because the long pole shifted to the per-sample
logp/PG forwards after M2b batching. Names ragged batched prefill as the next
decode lever.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:39:10 +08:00
0f76c0fdb0 docs: M2b — batched decode results (token-identical + ~1.7x rollout, device-cache next)
Implementation log (docs/18) + Phase-3 row (evolution.md): rope_pos primitive +
gate, the batched engine (decode_attention/repeat_kv reused), the token-
identical batch gate, and the measured ~1.7x rollout-inclusive step speedup +
memory stabilization. Closes the M2 decode engine (M2a single-seq + M2b
batched); names the device-side cache as the remaining lever.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:20:01 +08:00
096e45b845 docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)
Implementation log (docs/18) + Phase-3 row (evolution.md): the clipped_pg_loss
op + gates, the actor-learner loop, the easy-task SFT baseline (held-out 18.7%,
plateaus → no generalization), the two systems walls the design doc flagged
(two 1B models OOM the 32GB box → β=0; naive rollout fragments the allocator →
cached temperature sampling, rollout still the long pole), and the result:
format holds, held-out 20.0% (+1.3pp, statistically flat) — the same wall as
DPO. Closes the SFT→KV-cache→DPO→GRPO post-training arc with honest limits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:01:22 +08:00
99090465bf docs: M3 — DPO results (infra correct, held-out correctness flat, over-optimization collapse)
Implementation log (docs/18) + Phase-3 row (evolution.md): the two ops + gates,
pair-gen (gold chosen / sampled-wrong rejected), reference-logprob caching, the
training loop, and the honest finding — reward margin + pref-acc rise but
held-out arithmetic correctness stays ~5-8% (flat within std-error) and
over-optimizes to collapse (margin +34 → 0% format). DPO reweights, it does not
install the capability; motivates M4 GRPO (optimize the verifiable reward online).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:38:06 +08:00
b39e6e7110 docs: M2a — KV-cache decode engine results (token-identical + length-dependent speedup)
Implementation log (docs/18) + Phase-3 row (evolution.md): the two decode
primitives and their gates, the engine design (host-cache baseline), the
token-identical centerpiece gate, and the measured throughput baseline showing
the cache win is sequence-length-dependent (~1.0x@32, ~1.9x@128, naive OOM@256).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:01:10 +08:00
1574e21d89 post-train: M1 — verifiable-arith eval scorer + SFT format-baseline result
eval_arith: load ckpt, greedy-generate per held-out prompt, parse \boxed{}
via the shared task checker, report format(boxed) + correctness pass-rates.
Reused as the verifiable-eval harness for M3 (DPO) / M4 (GRPO).

M1 result (100 held-out prompts, v12 1.05B base): SFT moves answer-format
adherence 0% -> 100%, arithmetic correctness 8% -- the intended split (SFT
buys the format; correctness is the verifiable-reward job of M3/M4). Logged
in docs/18 implementation log + a Phase-3 row in docs/evolution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 11:13:19 +08:00
9c70e99ae4 post-train: M1 — verifiable arithmetic task + SFT data generator
First post-training milestone (docs/18). Lands the verifiable task + its data
pipeline, all verified host-side (no CUDA); the SFT run itself reuses the
existing --sft-tsv path on the GPU box.

- task.rs: the shared task spec — two-operand integer arithmetic, answer in
  \boxed{N}, with parse_boxed_answer + check_answer (exact-match rule-based
  reward). One module reused by M1 (SFT data), M3 (DPO pairs), M4 (GRPO reward).
- gen_arith_task bin: writes arith_sft.tsv (--sft-tsv format) + held-out
  arith_eval_prompts.txt (greedy_sample format) + arith_eval_gold.txt; train
  deduped, eval disjoint from train.
- data.rs: extract assistant-only masking into a pure, testable sft_row()
  (behavior-preserving; single-turn bit-identical to fbf4ac2).

Gate (verified locally, no_cuda): cargo test -p xtrain-train --lib = 9/9 pass
(masking, SFT-target self-consistency over 2000 samples, parser edges, seed
determinism); a 200/50 gen run = clean 2-col TSV, correct gold incl. negatives,
0 train/eval leakage. SFT training run + format-eval pending on dash5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 22:52:25 +08:00
ab32168dcc docs: post-training stack design — SFT → KV-cache → DPO → GRPO (docs/18)
Design doc for a from-scratch post-training infra on top of xtrain. Ladder:
SFT (have it) → DPO → reward model (optional) → GRPO, each rung one new
post-training systems concept + a hard correctness gate (grad-check, PyTorch
parity, degenerate checks, a falsifiable 'it learns' signal).

Decisions aligned with the user (D1-D4):
- D1 scope: DPO → GRPO, reward model optional.
- D2 reward: rule-based / verifiable first; learned RM deferred.
- D3 rollout: build the KV-cache incremental-decode engine UP FRONT (not
  naive-first) as the foundational milestone before DPO/GRPO.
- D4 task: a verifiable task (arithmetic/format) with deterministic exact-match
  reward, for a clean RL signal.

Locked milestone order: M1 SFT task baseline → M2 KV-cache decode engine
(token-identical gate) → M3 DPO → M4 GRPO → M5 optional reward model. Status:
design only, no implementation yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 22:44:25 +08:00
7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check
- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens,
  81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but
  marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2%
  return confirms the 1B base is now data-limited.
- run 13: three SFT stages from the v12 base (synthetic / anchor /
  real-mix-repair). The pipeline works and produces a chat-shaped model that
  follows the format and stops, but none of the variants is a stable
  high-quality chat model — bottleneck is SFT data quality + selection signal
  (val loss decouples from generation quality), not infra.
- scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:19:12 +08:00
5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00
a1370446fe docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)
- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix +
  regression test), with the meta-lesson that op/single-GPU unit tests can
  miss launcher-level integration gaps — only the V9-PILOT end-to-end run on
  the real launcher path exposed it.
- 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap
  and its T21 fix.
- evolution.md: T21 row (Infra) recording the fix + meta-lesson.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
db70abe450 docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:

README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
  (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
  deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
  in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
  launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
  prose with a structured "## Phase 2" summary (5 features + honest results:
  flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
  dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
  reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
  KI-4 accepted tradeoff).

docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
  per-axis (算法/架构/Infra/数据) narrative + the two integration notes.

docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
  loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
  (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
  determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)

Docs-only; no code touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:11:47 +08:00
71b0a1621f docs: T17 process-per-GPU results — measured throughput-neutral
Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:03:14 +08:00
c470c627a7 docs: Phase T17 — process-per-GPU DDP design
torchrun-style: launcher spawns N worker processes, each with its own CUDA
context; cross-process ncclUniqueId distributed via launcher-minted hex env
injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused
unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped
(user scope decision).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:44:38 +08:00
2ff4573a31 docs: T15 GQA results + evolution row (模型架构) + README build-journey row
Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row +
cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc
index range (00..14).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:44:58 +08:00
62b1cb5dc7 docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:30:34 +08:00
c36cdf74d1 Merge t18-dropout into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	crates/xtrain-autodiff/tests/autograd.rs
#	crates/xtrain-model/src/model.rs
#	crates/xtrain-train/src/bin/train.rs
#	crates/xtrain-train/src/train_loop.rs
#	docs/evolution.md
2026-06-18 00:41:41 +08:00
f26db882e5 Merge t16-grad-accum into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	docs/evolution.md
2026-06-18 00:37:11 +08:00
80fafa1914 docs: T18 evolution row + README build-journey row (dropout)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:06:06 +08:00
6b8c1e4e0f docs: Phase T18 — dropout design (device RNG + mask)
Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p)
scaling at train, identity at eval. New autodiff `dropout` op (fwd generates +
applies mask, bwd applies the SAME cached mask). Wired at the two residual-path
sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused
SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic
seed (so T13 recompute reproduces the same mask), train/eval switch, p=0
bit-identity, and the acceptance gates.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:08 +08:00
8bd7db16e1 docs: T16 grad-accum results — evolution row + README build-journey
dash5-verified gate numbers: accum=N bit-close to N× big batch (loss
8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches
single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big →
7.2GB accum, −74%), xserv closed loop md5-identical + token-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:52:32 +08:00
d01fec6639 docs: Phase T16 — gradient accumulation design
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:41:17 +08:00
9064ced4c2 docs: T14 flash-attention results + evolution/README rows
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:34:10 +08:00
65a2264227 docs: Phase T14 — fused flash-attention design
Design doc for the hand-written single fused flash-attention kernel:
online softmax tiled over KV, NEVER materializing the [bh,S,S] score
matrix; flash-style backward (recompute scores from saved logsumexp +
D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:16 +08:00
511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98)
v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale
dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute.
8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h.

Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 /
v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085
AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were
partly capacity-limited, scaling capacity > repeating old data. But the gain
is only ~3% (same magnitude as the data-axis single-step lever), and v8's
val was still descending at the end (not saturated).

Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6,
capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress,
scale capacity AND data together (Chinchilla, reproduced at toy scale).

- docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples
- docs/runs/README.md: +v8 row, v9 proposal
- docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 15:12:01 +08:00
0150263055 perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16,
batch32 seq256, steady-state):

- Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL —
  fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within
  tol", exactly equal). Full suite green with recompute on/off; DDP loss-match
  5.67e-7; DDP+recompute 2-rank descends 11.079→6.010.
- dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s
  39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band).
- dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607
  MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges.
  → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed.

Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:50:29 +08:00
69c5f07359 docs: Phase T13 — activation recompute
Design doc for per-block gradient checkpointing (KI-3): the no-tape forward +
recompute-on-backward design, the `checkpoint` primitive, per-block wrapping,
the exactness/correctness argument (same kernels + inputs → identical grads),
composition with bf16+DDP+batched, and the verification plan (on-vs-off grad
gate + memory/throughput before→after, dim1024-fits). Bench table left as TBD
to fill after the dash5 run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:45:16 +08:00
9c557f0609 docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01)
v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).

Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.

Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 03:55:47 +08:00
b4bb426d48 docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution)
第一版脱离 TinyStories:纯 FineWeb-edu 真实网页文本(2.255B 语料),架构同
v4/v5(dim768/18L, core 127.43M),8 卡 DDP bf16,2.29B tok/1.02ep,~1.9h
@218K tok/s。train 11.03→3.14,best/final FineWeb val 3.0652。

方法论:FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实
网页熵高,~3.0 是预期非回退;判据是采样质量 + transfer eval。

- 新增 docs/runs/06-v6-fineweb-edu-dim768.md:数据管线(scripts/fineweb_to_txt.py)
  / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) /
  方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用
  数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律
  掉进小故事)
- README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案
- evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 22:21:43 +08:00
88bec270af docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:52 +08:00
579365f4a0 docs: run v5 — TinyStories saturation at dim768 (val 1.11)
设计文档 05-v5-tinystories-dim768.md(中文,xserv 风格):数据 2.49B tok/5.33ep、
架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。
核心发现「数据天花板」:v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平
⇒ TinyStories 在 dim768/127M-core 近饱和,v6 该换轴(更大模型/更广语料,非更多 TinyStories)。
xserv BF16 服务 3/3 prompt 逐 token 一致。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
8a1e29543b run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11)
v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列,让 TinyStories 数据饱和可见
(v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平)。下一档提案改为 v6 换轴。
导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768(checkpoint/safetensors 不入库)。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
320c1ae4fb perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K
bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32
OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX
5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state):

  config          batch  peak mem   tok/s   fits 32GB
  fp32            16      27.2 GB    31.5K   yes
  bf16            16      19.3 GB    35.5K   yes   (-29% mem / +13% tok/s)
  fp32            32      —          —       OOM
  bf16            32      31.1 GB    40.8K   yes   (+29% vs fp32-b16)

Verified on dash5: fp32 suite green at tight tol + xserv export md5
bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99
6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs
3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly.

Mark KI-2 FIXED; fill docs/11 results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:28:20 +08:00
30db62d8f2 docs: Phase T12 — bf16 mixed precision design
docs/11-bf16-mixed-precision.md: the AMP split (bf16 linears +
activations, fp32 master / norms / softmax / RoPE / CE, no loss
scaling), the cast-op bridge, module layout, and the dual
verification gate (fp32 unchanged + bf16 looser-tol + convergence +
mem/throughput). Memory/throughput before->after to be filled from
the dash5 bench.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:15:02 +08:00
511ceebbb3 docs: KI-2 trigger — dim768 fp32 batch-32 OOM
v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32
(global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation
mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as
the backlog item it is (still deferred).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:42 +08:00
ff79fee3c5 docs: run v4 — TinyStories, dim768, val 1.17
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:37 +08:00
f85bd4d276 perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8
Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):

  single-GPU: 40226 -> 92638 tok/s  (~2.3x)
  DDP scaling (global batch 32*world):
    world  before        after
      1    39801 1.00x    92385 1.00x
      2    47229 1.19x   146821 1.59x
      4    52854 1.33x   269867 2.92x
      8    48996 1.23x   461270 4.99x

8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.

Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:15:02 +08:00
4c3f332f64 docs: Phase T11 — caching allocator
Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size
classes, per-device + thread-safety, Drop->return, transparency/correctness
argument, why skip-memset uninit is deferred, dual verification gates). Before/
after numbers filled after dash5 measurement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:04:11 +08:00
d422c68704 docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky)
The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the
original ungrouped all-reduce is itself run-to-run nondeterministic on
this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7,
1.19e-7}), so the T8 test's `max|p0-p1| == 0.0` assertion is flaky here
(passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP,
numerically benign; loss-match stays ~6e-7). Noted as a follow-up to
loosen the assertion to a tight tolerance; coalescing was reverted purely
because it gives ~0 scaling benefit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:42:13 +08:00
84092fb28d docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11)
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:

  - grad all-reduce is only ~6-7% of each step;
  - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
    SAME per-rank workload) and dominates;
  - coalescing the ~150 per-tensor all-reduces into one grouped/flat
    launch gives ~0 scaling gain AND breaks cross-rank bit-identity
    (max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
    the coalescing commit (b8b5821) was reverted.

Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:40:45 +08:00
a78502e0f0 docs: run v3 — TinyStories, dim512, val 1.30
Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:37:45 +08:00
64b2a8c09e run: v3 archive + export (dim512, single-GPU batched, val 1.30)
v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch),
single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry
~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json +
model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in
xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd
diverges on BF16 drift as in v1/v2).

best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out
1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward)
validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU
avoids KI-5. Update docs/runs/README.md comparison table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:37:36 +08:00
9a25616a30 docs: Phase T10 — batched forward
docs/09-batched-forward.md: the launch-bound diagnosis recap, the
[B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position +
causal masking inline in softmax), the attention forward/backward via
strided-batched GEMM, autograd implications, the looped-split/merge dead-end
post-mortem (1127 tok/s, host round-trips), verification methods + before→after
throughput, and the v3 recommendation (per-rank batch 16-32, single/small world
until KI-5 bucketed all-reduce lands).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:50 +08:00
4ccab0fb42 perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:

  before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
  after  (batched):    25627 tok/s (batch16) / 40263 (batch32),
                       util 37% mean / 54% peak, ~10 GB
  → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.

A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:43 +08:00
d2a585c5cb docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:

  global_batch 32 (8/rank)  → 3163 tok/s
  global_batch 256 (64/rank)→ 3200 tok/s   (8× batch, +1.2%, within noise)

8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:20:26 +08:00
bf679f6f1f docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h
SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M
tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4
RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58
and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the
dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts
(3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat
(global batch too small → all-reduce dominates) → links docs/known-issues
KI-1; v3 proposal applies KI-1's fix (much larger global batch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:38:31 +08:00
c87a0bc44e docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:56:58 +08:00
264660527f docs: run v1 — TinyStories full, dim256
docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table.
v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M).
Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent
stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical
in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:09:46 +08:00
8981cf7982 docs: T9 verification results (xserv == xtrain, dash5)
Capture the closed-loop run: train (loss 10.84->3.59) -> export (47 tensors,
BF16) -> xserv dump-logits + greedy. Top-1 + top-11 token order identical,
logits within ~1e-2 (BF16-vs-f32 drift), greedy generation token-for-token
identical across two prompts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:37:46 +08:00