59 Commits

Author SHA1 Message Date
18c2229b4b docs: Phase T9 — export to xserv
Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the
QK-norm structural decision + BF16 acceptance criterion, the tensor-name +
layout mapping table, and the dash5 closed-loop verification recipe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:33:32 +08:00
0131f05b26 docs: Phase T8 — distributed data parallel
Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped
CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then-
local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps
its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same
init/state), batch sharding math vs single-GPU, verification plan + scaling
table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:49 +08:00
5e8add2a41 docs: Phase T7 — performance
Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd
(row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step
param/grad roundtrip), drop per-op sync + device memset. Includes the
verification table (regression suite green + tok/s 2770→8220 ~3x), the
deferred bf16/recompute follow-up rationale, and the T8 all-reduce note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:00:29 +08:00
29b4d30b6c docs: Phase T6 — training loop
Design doc for the T6 training stack: Goal / Module Layout / Key Design
Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with
batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse,
sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host
unit tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:30:14 +08:00
8565565647 docs: Phase T5 — tiny transformer
Goal / Module Layout / Key Design Decisions (multi-head layout via
reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add,
x@W convention, causal mask, params API, overfit methodology) / 验证方法
with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:09:30 +08:00
777f3c7949 docs: Phase T4 — autograd engine
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
dde2fde297 docs: Phase T3 — GEMM fwd/bwd + finite-diff
Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:27:03 +08:00
8557a289a2 docs: Phase T2 — tensor abstraction
Design doc for the minimal tensor layer: DType/shape/Storage/Tensor,
host↔device copy, and one elementwise kernel (scale) wired end-to-end.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:12:55 +08:00
c1b204296b docs: backfill T1 build-chain
T1 shipped without a design doc; capture the Rust↔CUDA build chain
(build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:12:55 +08:00