v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).
Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.
Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/11-bf16-mixed-precision.md: the AMP split (bf16 linears +
activations, fp32 master / norms / softmax / RoPE / CE, no loss
scaling), the cast-op bridge, module layout, and the dual
verification gate (fp32 unchanged + bf16 looser-tol + convergence +
mem/throughput). Memory/throughput before->after to be filled from
the dash5 bench.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32
(global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation
mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as
the backlog item it is (still deferred).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):
single-GPU: 40226 -> 92638 tok/s (~2.3x)
DDP scaling (global batch 32*world):
world before after
1 39801 1.00x 92385 1.00x
2 47229 1.19x 146821 1.59x
4 52854 1.33x 269867 2.92x
8 48996 1.23x 461270 4.99x
8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.
Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the
original ungrouped all-reduce is itself run-to-run nondeterministic on
this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7,
1.19e-7}), so the T8 test's `max|p0-p1| == 0.0` assertion is flaky here
(passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP,
numerically benign; loss-match stays ~6e-7). Noted as a follow-up to
loosen the assertion to a tight tolerance; coalescing was reverted purely
because it gives ~0 scaling benefit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:
- grad all-reduce is only ~6-7% of each step;
- per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
SAME per-rank workload) and dominates;
- coalescing the ~150 per-tensor all-reduces into one grouped/flat
launch gives ~0 scaling gain AND breaks cross-rank bit-identity
(max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
the coalescing commit (b8b5821) was reverted.
Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/09-batched-forward.md: the launch-bound diagnosis recap, the
[B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position +
causal masking inline in softmax), the attention forward/backward via
strided-batched GEMM, autograd implications, the looped-split/merge dead-end
post-mortem (1127 tok/s, host round-trips), verification methods + before→after
throughput, and the v3 recommendation (per-rank batch 16-32, single/small world
until KI-5 bucketed all-reduce lands).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:
before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
after (batched): 25627 tok/s (batch16) / 40263 (batch32),
util 37% mean / 54% peak, ~10 GB
→ single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.
A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:
global_batch 32 (8/rank) → 3163 tok/s
global_batch 256 (64/rank)→ 3200 tok/s (8× batch, +1.2%, within noise)
8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h
SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M
tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4
RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58
and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the
dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts
(3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat
(global batch too small → all-reduce dominates) → links docs/known-issues
KI-1; v3 proposal applies KI-1's fix (much larger global batch).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the
QK-norm structural decision + BF16 acceptance criterion, the tensor-name +
layout mapping table, and the dash5 closed-loop verification recipe.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped
CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then-
local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps
its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same
init/state), batch sharding math vs single-GPU, verification plan + scaling
table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd
(row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step
param/grad roundtrip), drop per-op sync + device memset. Includes the
verification table (regression suite green + tok/s 2770→8220 ~3x), the
deferred bf16/recompute follow-up rationale, and the T8 all-reduce note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for the T6 training stack: Goal / Module Layout / Key Design
Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with
batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse,
sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host
unit tests).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for the minimal tensor layer: DType/shape/Storage/Tensor,
host↔device copy, and one elementwise kernel (scale) wired end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T1 shipped without a design doc; capture the Rust↔CUDA build chain
(build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>