xtrain

Author	SHA1	Message	Date
Gahow Wang	9c557f0609	docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01) v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256), trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's 1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to registry v7-fineweb-edu-dim768, serves in xserv (coherent expository English, ~v6 quality). Key finding: more epochs of the SAME subset gave only ~0.05 val drop and the curve flattened (~step 44000) with no sampling quality gain → the 2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's TinyStories data-volume saturation: repeating old data has thin margins; true further gains need FRESH shards (more diverse tokens), as v6's corpus-swap (which raised the ceiling) showed. Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis ceiling note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 03:55:47 +08:00
Gahow Wang	b4bb426d48	docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution) 第一版脱离 TinyStories：纯 FineWeb-edu 真实网页文本(2.255B 语料)，架构同 v4/v5(dim768/18L, core 127.43M)，8 卡 DDP bf16，2.29B tok/1.02ep，~1.9h @218K tok/s。train 11.03→3.14，best/final FineWeb val 3.0652。方法论：FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实网页熵高，~3.0 是预期非回退；判据是采样质量 + transfer eval。 - 新增 docs/runs/06-v6-fineweb-edu-dim768.md：数据管线(scripts/fineweb_to_txt.py) / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) / 方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律掉进小故事) - README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案 - evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 22:21:43 +08:00
Gahow Wang	88bec270af	docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:30:52 +08:00
Gahow Wang	579365f4a0	docs: run v5 — TinyStories saturation at dim768 (val 1.11) 设计文档 05-v5-tinystories-dim768.md（中文，xserv 风格）：数据 2.49B tok/5.33ep、架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。核心发现「数据天花板」：v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平 ⇒ TinyStories 在 dim768/127M-core 近饱和，v6 该换轴（更大模型/更广语料，非更多 TinyStories）。 xserv BF16 服务 3/3 prompt 逐 token 一致。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	8a1e29543b	run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11) v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列，让 TinyStories 数据饱和可见（v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平）。下一档提案改为 v6 换轴。导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768（checkpoint/safetensors 不入库）。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	320c1ae4fb	perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32 OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX 5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state): config batch peak mem tok/s fits 32GB fp32 16 27.2 GB 31.5K yes bf16 16 19.3 GB 35.5K yes (-29% mem / +13% tok/s) fp32 32 — — OOM bf16 32 31.1 GB 40.8K yes (+29% vs fp32-b16) Verified on dash5: fp32 suite green at tight tol + xserv export md5 bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99 6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs 3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly. Mark KI-2 FIXED; fill docs/11 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:28:20 +08:00
Gahow Wang	30db62d8f2	docs: Phase T12 — bf16 mixed precision design docs/11-bf16-mixed-precision.md: the AMP split (bf16 linears + activations, fp32 master / norms / softmax / RoPE / CE, no loss scaling), the cast-op bridge, module layout, and the dual verification gate (fp32 unchanged + bf16 looser-tol + convergence + mem/throughput). Memory/throughput before->after to be filled from the dash5 bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:15:02 +08:00
Gahow Wang	511ceebbb3	docs: KI-2 trigger — dim768 fp32 batch-32 OOM v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32 (global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as the backlog item it is (still deferred). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:42 +08:00
Gahow Wang	ff79fee3c5	docs: run v4 — TinyStories, dim768, val 1.17 Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep / arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128 per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8 DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU / v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32 batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table to v0/v1/v2/v3/v4 and the next-rung proposal to v5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:37 +08:00
Gahow Wang	f85bd4d276	perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8 Device caching/pool allocator removes the per-op cudaMalloc serialization that was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX 5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s): single-GPU: 40226 -> 92638 tok/s (~2.3x) DDP scaling (global batch 32*world): world before after 1 39801 1.00x 92385 1.00x 2 47229 1.19x 146821 1.59x 4 52854 1.33x 269867 2.92x 8 48996 1.23x 461270 4.99x 8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it. Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the ~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever if v4 needs higher linearity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:15:02 +08:00
Gahow Wang	4c3f332f64	docs: Phase T11 — caching allocator Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size classes, per-device + thread-safety, Drop->return, transparency/correctness argument, why skip-memset uninit is deferred, dual verification gates). Before/ after numbers filled after dash5 measurement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00
Gahow Wang	d422c68704	docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky) The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the original ungrouped all-reduce is itself run-to-run nondeterministic on this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}), so the T8 test's `max\|p0-p1\| == 0.0` assertion is flaky here (passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP, numerically benign; loss-match stays ~6e-7). Noted as a follow-up to loosen the assertion to a tight tolerance; coalescing was reverted purely because it gives ~0 scaling benefit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:42:13 +08:00
Gahow Wang	84092fb28d	docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11) T11 set out to coalesce/overlap the gradient all-reduce per the original KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch 32, seq 256) falsifies that hypothesis: - grad all-reduce is only ~6-7% of each step; - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the SAME per-rank workload) and dominates; - coalescing the ~150 per-tensor all-reduces into one grouped/flat launch gives ~0 scaling gain AND breaks cross-rank bit-identity (max\|p0-p1\| 0.0 → 1.49e-8), violating the T8 correctness gate — so the coalescing commit (`b8b5821`) was reverted. Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy at a time; CPU not starved; per-thread default stream doesn't help): single-process thread-per-GPU ranks serialize on the single CUDA context's per-op cudaMalloc / driver calls. Fix direction (out of T11 scope): a caching/pool allocator, or process-per-GPU. Recorded in docs/known-issues.md with the measured table; KI-5 stays Open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:40:45 +08:00
Gahow Wang	a78502e0f0	docs: run v3 — TinyStories, dim512, val 1.30 Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:45 +08:00
Gahow Wang	64b2a8c09e	run: v3 archive + export (dim512, single-GPU batched, val 1.30) v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch), single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry ~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json + model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd diverges on BF16 drift as in v1/v2). best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out 1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward) validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU avoids KI-5. Update docs/runs/README.md comparison table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:36 +08:00
Gahow Wang	9a25616a30	docs: Phase T10 — batched forward docs/09-batched-forward.md: the launch-bound diagnosis recap, the [B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position + causal masking inline in softmax), the attention forward/backward via strided-batched GEMM, autograd implications, the looped-split/merge dead-end post-mortem (1127 tok/s, host round-trips), verification methods + before→after throughput, and the v3 recommendation (per-rank batch 16-32, single/small world until KI-5 bucketed all-reduce lands). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:50 +08:00
Gahow Wang	4ccab0fb42	perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x) Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling") FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU, back-to-back A/B: before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB after (batched): 25627 tok/s (batch16) / 40263 (batch32), util 37% mean / 54% peak, ~10 GB → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%. A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x. The v3 falsification history (larger batch doesn't help a single-seq design) is kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching exposes (eager all-reduce of all params each step) → recorded as KI-5 (bucketed/overlapped all-reduce), out of T10 scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:43 +08:00
Gahow Wang	d2a585c5cb	docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling v3 tested the documented mitigation (raise global_batch to amortize the per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L, seq256: global_batch 32 (8/rank) → 3163 tok/s global_batch 256 (64/rank)→ 3200 tok/s (8× batch, +1.2%, within noise) 8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is launch-bound: the single-sequence model design (each sequence its own tiny forward/backward, per-op kernel launches) starves the GPU, and batching only adds proportionally more serial launches. Real fix is batched (multi-sequence) forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob. Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is fixed). KI-1 kept Open with the corrected root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:20:26 +08:00
Gahow Wang	bf679f6f1f	docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71) Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:38:31 +08:00
Gahow Wang	c87a0bc44e	docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1 single-GPU. Root cause + proposed fixes recorded; also consolidates deferred T7 items (bf16, activation recompute) and the large-vocab modeling note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:56:58 +08:00
Gahow Wang	264660527f	docs: run v1 — TinyStories full, dim256 docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table. v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M). Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:09:46 +08:00
Gahow Wang	8981cf7982	docs: T9 verification results (xserv == xtrain, dash5) Capture the closed-loop run: train (loss 10.84->3.59) -> export (47 tensors, BF16) -> xserv dump-logits + greedy. Top-1 + top-11 token order identical, logits within ~1e-2 (BF16-vs-f32 drift), greedy generation token-for-token identical across two prompts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:37:46 +08:00
Gahow Wang	18c2229b4b	docs: Phase T9 — export to xserv Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the QK-norm structural decision + BF16 acceptance criterion, the tensor-name + layout mapping table, and the dash5 closed-loop verification recipe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:32 +08:00
Gahow Wang	0131f05b26	docs: Phase T8 — distributed data parallel Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then- local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same init/state), batch sharding math vs single-GPU, verification plan + scaling table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:49 +08:00
Gahow Wang	5e8add2a41	docs: Phase T7 — performance Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd (row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step param/grad roundtrip), drop per-op sync + device memset. Includes the verification table (regression suite green + tok/s 2770→8220 ~3x), the deferred bf16/recompute follow-up rationale, and the T8 all-reduce note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:00:29 +08:00
Gahow Wang	29b4d30b6c	docs: Phase T6 — training loop Design doc for the T6 training stack: Goal / Module Layout / Key Design Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse, sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host unit tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:14 +08:00
Gahow Wang	8565565647	docs: Phase T5 — tiny transformer Goal / Module Layout / Key Design Decisions (multi-head layout via reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add, x@W convention, causal mask, params API, overfit methodology) / 验证方法 with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	777f3c7949	docs: Phase T4 — autograd engine Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	dde2fde297	docs: Phase T3 — GEMM fwd/bwd + finite-diff Design doc covering the tiled forward, the dA/dB math + how transpose is handled (materialize + reuse forward), the cuBLAS row-major reference, and the finite-diff harness design + how T4 reuses it per-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:27:03 +08:00
Gahow Wang	8557a289a2	docs: Phase T2 — tensor abstraction Design doc for the minimal tensor layer: DType/shape/Storage/Tensor, host↔device copy, and one elementwise kernel (scale) wired end-to-end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00
Gahow Wang	c1b204296b	docs: backfill T1 build-chain T1 shipped without a design doc; capture the Rust↔CUDA build chain (build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00

31 Commits