Commit Graph

144 Commits

Author SHA1 Message Date
2c9b58cb3b post-train: M2b — batched KV-cache decode (G-way, token-identical)
The rollout long-pole fix deferred from M2a: decode the G samples of one prompt
in lockstep (one forward per step over the group → G× fewer kernel launches).

- rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward-
  only kernel) — G rows share one decode position. Gate: == full rope for
  [0..n], == rope_at(P) per row for uniform P (bit-identical).
- generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step.
  decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G)
  broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next).
- Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single-
  sequence decode (8 query / 2 kv heads → exercises repeat_kv batching).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:18:54 +08:00
096e45b845 docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)
Implementation log (docs/18) + Phase-3 row (evolution.md): the clipped_pg_loss
op + gates, the actor-learner loop, the easy-task SFT baseline (held-out 18.7%,
plateaus → no generalization), the two systems walls the design doc flagged
(two 1B models OOM the 32GB box → β=0; naive rollout fragments the allocator →
cached temperature sampling, rollout still the long pole), and the result:
format holds, held-out 20.0% (+1.3pp, statistically flat) — the same wall as
DPO. Closes the SFT→KV-cache→DPO→GRPO post-training arc with honest limits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 17:01:22 +08:00
7fb3b32fd9 post-train: M4 — GRPO actor-learner loop + cached temperature rollout
train_grpo: the online, critic-free RL loop — per step sample B prompts, roll
out G completions each, score with the rule-based checker (reward 0/1), compute
group-relative advantage A=(r−mean)/(std+ε), then K inner clipped_pg_loss
epochs with a KL leash to the frozen reference. Reward = pure 0/1 correctness
(KL is the format protector, the M3 collapse lesson). Tracks mean rollout reward
(the falsifiable "it learns" signal). Periodic checkpoint save.

decode: generate_cached adds temperature sampling to the KV-cache engine (M2) —
single-row [1,vocab] logits per step vs the naive sampler's [seq,vocab], far
lighter on the caching allocator (the naive sampler fragments it over a long
rollout). generate_greedy_cached now routes through it (temp 0); decode_kv
token-identical gate still passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 16:59:05 +08:00
aaa77082ef post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op)
The GRPO (M4) token-level loss op + the one primitive it needs:

- scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The
  clipped-PG backward scales each completion token's row of (probs − onehot) by
  its own per-token coefficient, which cross_entropy_backward's single scalar
  scale can't express.
- clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token
  ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL
  (k3 estimator), masked to completion tokens. Backward reuses the CE machinery
  (probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0
  (KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 14:07:02 +08:00
99090465bf docs: M3 — DPO results (infra correct, held-out correctness flat, over-optimization collapse)
Implementation log (docs/18) + Phase-3 row (evolution.md): the two ops + gates,
pair-gen (gold chosen / sampled-wrong rejected), reference-logprob caching, the
training loop, and the honest finding — reward margin + pref-acc rise but
held-out arithmetic correctness stays ~5-8% (flat within std-error) and
over-optimizes to collapse (margin +34 → 0% format). DPO reweights, it does not
install the capability; motivates M4 GRPO (optimize the verifiable reward online).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:38:06 +08:00
2f827fd6d8 post-train: M3 — DPO pair-gen + training loop (verifiable arithmetic)
gen_dpo_pairs: chosen = gold answer, rejected = the SFT model's own greedy
(KV-cache engine, M2a) completion when it's a format-valid WRONG boxed answer —
a hard negative from the model's distribution. ~8% of prompts skipped (greedy
correct). Writes question<TAB>chosen<TAB>rejected (bare, SFT-framed at train).

train_dpo: loads the SFT ckpt as policy AND frozen reference; precomputes the
reference logprobs ONCE (policy==ref) and caches them (one resident model). Each
step forwards the policy on chosen+rejected, seq_logprob each, minimises
dpo_loss; the two forwards share params so backward accumulates both branches.
Tracks reward margin + preference accuracy (the doc-13 "don't trust loss alone"
health signal). Loss starts at exactly log2 (Δ=0 at init) — a built-in check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:37:01 +08:00
f3c764ce95 post-train: M3 — seq_logprob + dpo_loss autograd ops
Two new ops for DPO (M3), both reusing existing kernels (no new CUDA):

- seq_logprob(logits, target): Σ log πθ(target) over non-ignored (target≥0)
  positions — the per-sequence logprob DPO compares between policy and
  reference. = −Σ per_row of cross_entropy (ignored rows already 0, like SFT
  masking); backward = cross_entropy_backward(probs, target, −upstream) (sum,
  no mean division). Gate: finite-diff grad-check with a -100 completion mask.

- dpo_loss(lpθ_chosen, lpθ_rejected, lpref_chosen, lpref_rejected, β): scalar
  L = −log σ(Δ) = softplus(−Δ) with the two policy logprobs as parents (ref
  logprobs constant). Gate: grad-check both parents + degenerate points
  (policy==ref ⇒ Δ=0, L=log2, grads ∓β/2; β=0 ⇒ grads 0). Same formula as TRL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:11:01 +08:00
b39e6e7110 docs: M2a — KV-cache decode engine results (token-identical + length-dependent speedup)
Implementation log (docs/18) + Phase-3 row (evolution.md): the two decode
primitives and their gates, the engine design (host-cache baseline), the
token-identical centerpiece gate, and the measured throughput baseline showing
the cache win is sequence-length-dependent (~1.0x@32, ~1.9x@128, naive OOM@256).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:01:10 +08:00
eff26a0898 post-train: M2a — KV-cache incremental decode engine (token-identical)
Single-sequence KV-cache decode (xtrain-model/src/decode.rs): per-layer K/V
cache + single-token incremental forward (prefill = first prompt.len() decode
steps, one code path). Mirrors model::block_forward at the raw-Tensor level (no
autograd tape — inference needs no grads), using rope_at + decode_attention.
Cache is host-accumulated token-major f32, rebuilt per step (the honest M2a
baseline; M2b moves it device-side + batched ragged).

Gate (the M2 centerpiece): KV-cache greedy decode is TOKEN-IDENTICAL to the
naive full-recompute greedy — tests/decode_kv.rs (small GQA model, F32, 24
tokens) and corroborated on the v12 1.05B SFT checkpoint (cached eval =
naive eval byte-for-byte: format 100/100, correct 8/100).

eval_arith --cached A/Bs the two paths + reports decode tok/s. Measured on v12
(1.05B, batch 1, F32): the cache win is sequence-length-dependent —
  max_new=32   naive 108 vs cached 111 tok/s  (~1.0x; overhead-bound)
  max_new=128  naive  69 vs cached 133 tok/s  (~1.9x)
  max_new=256  naive OOM     vs cached 129 tok/s
Cached throughput stays ~constant (O(1)/token) while naive decays (O(t)/token,
O(seq^2) graph → OOM at length). Short eval prompts are overhead-bound, so the
cache matters for long rollouts (DPO/GRPO), not the arithmetic eval itself.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:00:03 +08:00
c88e2ab88c post-train: M2 — decode primitives (rope_at + decode_attention)
Two forward-only Tensor primitives the KV-cache decode engine is built on,
each gated by an isolated correctness test:

- rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no
  modulo) for a single decode token, vs the training rope_k (pos = row %
  period) left untouched. New forward-only CUDA kernel, no training-path risk.
  Gate: bit-identical to the full-sequence rope's corresponding row.
- decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from
  the existing strided batched GEMM + plain (non-causal) softmax — no new
  kernel. Gate: equals the full causal attention's last query row (max |Δ| 6e-8).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:00:03 +08:00
1574e21d89 post-train: M1 — verifiable-arith eval scorer + SFT format-baseline result
eval_arith: load ckpt, greedy-generate per held-out prompt, parse \boxed{}
via the shared task checker, report format(boxed) + correctness pass-rates.
Reused as the verifiable-eval harness for M3 (DPO) / M4 (GRPO).

M1 result (100 held-out prompts, v12 1.05B base): SFT moves answer-format
adherence 0% -> 100%, arithmetic correctness 8% -- the intended split (SFT
buys the format; correctness is the verifiable-reward job of M3/M4). Logged
in docs/18 implementation log + a Phase-3 row in docs/evolution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 11:13:19 +08:00
cb64604496 post-train: M1 fix — enlarge arith key space + saturation guard
The default operand ranges (max_add=99, max_mul=12) gave only ~20k unique
problems, so 'gen_arith_task --n 20000 --eval 500' (a) made train dedup
pathologically slow near saturation and (b) made the disjoint-eval loop never
terminate. A background run stalled after ~10k train rows with no eval files.

Fix (root cause, not a workaround):
- enlarge default ranges to max_add=999, max_mul=99 (~2.01M key space) so 20k+
  requests are a tiny fraction and dedup stays trivial;
- add unique_space() + a generator guard that errors clearly when n+eval exceeds
  80% of the key space, instead of looping forever.

Verified: cargo test 10/10; full 20000/500 gen now 0.2s, all 3 files, 0
train/eval leakage; guard panics on an oversized (--max-add 99) request.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 23:28:25 +08:00
9c70e99ae4 post-train: M1 — verifiable arithmetic task + SFT data generator
First post-training milestone (docs/18). Lands the verifiable task + its data
pipeline, all verified host-side (no CUDA); the SFT run itself reuses the
existing --sft-tsv path on the GPU box.

- task.rs: the shared task spec — two-operand integer arithmetic, answer in
  \boxed{N}, with parse_boxed_answer + check_answer (exact-match rule-based
  reward). One module reused by M1 (SFT data), M3 (DPO pairs), M4 (GRPO reward).
- gen_arith_task bin: writes arith_sft.tsv (--sft-tsv format) + held-out
  arith_eval_prompts.txt (greedy_sample format) + arith_eval_gold.txt; train
  deduped, eval disjoint from train.
- data.rs: extract assistant-only masking into a pure, testable sft_row()
  (behavior-preserving; single-turn bit-identical to fbf4ac2).

Gate (verified locally, no_cuda): cargo test -p xtrain-train --lib = 9/9 pass
(masking, SFT-target self-consistency over 2000 samples, parser edges, seed
determinism); a 200/50 gen run = clean 2-col TSV, correct gold incl. negatives,
0 train/eval leakage. SFT training run + format-eval pending on dash5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 22:52:25 +08:00
ab32168dcc docs: post-training stack design — SFT → KV-cache → DPO → GRPO (docs/18)
Design doc for a from-scratch post-training infra on top of xtrain. Ladder:
SFT (have it) → DPO → reward model (optional) → GRPO, each rung one new
post-training systems concept + a hard correctness gate (grad-check, PyTorch
parity, degenerate checks, a falsifiable 'it learns' signal).

Decisions aligned with the user (D1-D4):
- D1 scope: DPO → GRPO, reward model optional.
- D2 reward: rule-based / verifiable first; learned RM deferred.
- D3 rollout: build the KV-cache incremental-decode engine UP FRONT (not
  naive-first) as the foundational milestone before DPO/GRPO.
- D4 task: a verifiable task (arithmetic/format) with deterministic exact-match
  reward, for a clean RL signal.

Locked milestone order: M1 SFT task baseline → M2 KV-cache decode engine
(token-identical gate) → M3 DPO → M4 GRPO → M5 optional reward model. Status:
design only, no implementation yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 22:44:25 +08:00
7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check
- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens,
  81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but
  marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2%
  return confirms the 1B base is now data-limited.
- run 13: three SFT stages from the v12 base (synthetic / anchor /
  real-mix-repair). The pipeline works and produces a chat-shaped model that
  follows the format and stops, but none of the variants is a stable
  high-quality chat model — bottleneck is SFT data quality + selection signal
  (val loss decouples from generation quality), not infra.
- scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:19:12 +08:00
fbf4ac2917 sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval
Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path
used by the v12 SFT runs:

- cross_entropy ignores negative targets (-100 ignore-index), normalizing by
  valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu).
- Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is
  formatted as 'User: .. \nAssistant:' + answer + <|endoftext|>, prompt tokens
  masked to -100 while answer+EOS are supervised; i32 label cache alongside the
  u16 token cache; sample() retries windows that are fully masked; eval uses
  target_window so masking applies to val loss too (data.rs, train_loop.rs).
- train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues
  training from a base checkpoint.
- greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt
  generation eval.

Test fixtures updated for the new Corpus.labels field; dropout.rs carries
incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout);
correctness rests on the documented v12 base+SFT runs on the GPU box.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:19:02 +08:00
5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00
a1370446fe docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)
- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix +
  regression test), with the meta-lesson that op/single-GPU unit tests can
  miss launcher-level integration gaps — only the V9-PILOT end-to-end run on
  the real launcher path exposed it.
- 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap
  and its T21 fix.
- evolution.md: T21 row (Infra) recording the fix + meta-lesson.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
980605474b test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)
Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher
path (DdpContext::init + train_rank). It would have caught the original bug:

- GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is
  byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the
  step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce
  short-circuits and one step has no optimizer-state compounding; the only
  residual non-determinism is the engine's atomicAdd backward-reduction
  order (the documented fresh-train md5 caveat — dropout-independent), so the
  post-step params are checked against that tight ULP floor (< 1e-7).
- GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's
  run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible
  on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it.
- GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 —
  orders of magnitude above every noise floor here (~3e-2 observed). On the
  pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and
  the trace would match p=0 at the noise floor — this gate fails. (Verified by
  simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.)
- GATE C: a dedicated no-eval run ends with model.is_training() == true,
  direct proof that train_rank called model.train().
- p>0 run is finite (no NaN/Inf).

eval_every < steps so a periodic eval fires mid-run (flipping to eval mode),
exercising the per-step model.train() restore discipline the pilot called out.
Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
81f3cf59e5 distributed: T21 — wire dropout into the DDP path (--dropout + model.train())
V9-PILOT caught a launcher-level integration gap: T18 wired dropout into
the single-GPU bin/train, but the DDP path never did. train_ddp had no
--dropout flag and never set cfg.dropout, and ddp.rs::train_rank never
called model.train() — so under DDP every forward ran in the default eval
mode and dropout was a silent identity, regardless of config.

Fix, mirroring the single-GPU train/eval discipline:
- train_ddp.rs: add a --dropout <p> flag (default 0 = off, matching the
  prior behavior) and set cfg.dropout from it; log it when on.
- ddp.rs::train_rank: call model.train() at the start of each step (before
  the micro-batch loop). eval_loss() flips the model to eval mode and does
  not restore it, so re-asserting train() each step keeps dropout live
  across eval boundaries.

--dropout 0 (default) is bit-identical to the prior DDP path: cfg.dropout
stays 0 and ops::dropout(p=0) is a clone no-op regardless of training mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:08:17 +08:00
db70abe450 docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main:

README.md
- Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases"
  (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five
  deferred systems-stack features T14–T18).
- Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout
  in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU
  launcher in distributed).
- Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18
  prose with a structured "## Phase 2" summary (5 features + honest results:
  flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem,
  dropout × recompute bit-exact, T17 throughput-neutral falsification).
- Engineering lessons: T17 added as the THIRD profile-first falsification;
  reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9.
- Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED,
  KI-4 accepted tradeoff).

docs/evolution.md
- New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the
  per-axis (算法/架构/Infra/数据) narrative + the two integration notes.

docs/known-issues.md
- KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed
  loop; T19 DROPPED), not "open".
- New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock);
  (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid
  determinism gate is export re-determinism, not fresh-train reproduction.
- (process-per-GPU item was already CLOSED=measured no-op in T17.)

Docs-only; no code touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:11:47 +08:00
71b0a1621f docs: T17 process-per-GPU results — measured throughput-neutral
Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 18:03:14 +08:00
4abb17383a test: process-per-GPU DDP correctness (ddp_proc.rs)
Self-launching test: worker mode (XTRAIN_RANK set) trains on synthetic corpus
and dumps loss+params; launcher mode runs single-GPU baseline + thread-per-GPU
launch + spawns 2 worker processes, then asserts (a) proc loss == single-GPU
<1e-3, (b) cross-rank params <1e-6 (KI-5 ULP), (c) proc loss == thread-per-GPU
<1e-3. Run with --test-threads=1 (distributed harness property).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:52 +08:00
a188c8a277 distributed: train_ddp_mp bin (process-per-GPU launcher/worker)
Dual-mode binary self-detecting via XTRAIN_RANK: launcher spawns one worker
per visible GPU forwarding full argv; worker rebuilds config from argv and runs
run_worker. CLI flags identical to train_ddp (thread-per-GPU, kept), so it
doubles as the before->after throughput driver. thread-per-GPU path untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:52 +08:00
ffd548b80b distributed: process-per-GPU launcher + worker (proc.rs)
torchrun-style process-per-GPU: launch_processes spawns one worker process per
GPU (re-exec current_exe with XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID} env),
mints the ncclUniqueId once in the launcher and hex-injects it via env (no
shared FS/TCP, race-free). worker_env/run_worker read the env, bind the device
(own CUDA context), DdpContext::init + build_model + train_rank reused from T8
UNCHANGED. hex_encode/decode_unique_id are host-testable pure fns.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:48:43 +08:00
c470c627a7 docs: Phase T17 — process-per-GPU DDP design
torchrun-style: launcher spawns N worker processes, each with its own CUDA
context; cross-process ncclUniqueId distributed via launcher-minted hex env
injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused
unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped
(user scope decision).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:44:38 +08:00
2ff4573a31 docs: T15 GQA results + evolution row (模型架构) + README build-journey row
Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row +
cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc
index range (00..14).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:44:58 +08:00
39df0b40c1 gqa: fix kv-proj shape test param indices (embed,attn_norm precede wq)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:38:42 +08:00
830d06ad01 gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests)
- repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each
  kv head sums its group of query-head grads; no atomics) + Tensor/ops node.
- Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim;
  attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed
  & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical.
- --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export
  writes real num_key_value_heads (xserv repeat_kv grouping aligned).
- Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs
  (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape);
  parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:37:37 +08:00
62b1cb5dc7 docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 01:30:34 +08:00
4b6d3e0a79 test: flash+dropout cross-feature grad-check (Phase-2 integration)
Add flash_plus_dropout_grad_check_fp32 to xtrain-model dropout tests: the two
orthogonal Phase-2 features (T14 flash-attn, T18 dropout) in the same model must
still grad-check. Both models run train-mode p=0.2 (identical masks, seed is
flash-independent) so the only delta is the SDPA reduction order — checked against
the flash-vs-composed tolerance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:43:54 +08:00
c36cdf74d1 Merge t18-dropout into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	crates/xtrain-autodiff/tests/autograd.rs
#	crates/xtrain-model/src/model.rs
#	crates/xtrain-train/src/bin/train.rs
#	crates/xtrain-train/src/train_loop.rs
#	docs/evolution.md
2026-06-18 00:41:41 +08:00
f26db882e5 Merge t16-grad-accum into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts:
#	README.md
#	docs/evolution.md
2026-06-18 00:37:11 +08:00
9e958cb0f9 Merge t14-flash-attention into main
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:35:46 +08:00
80fafa1914 docs: T18 evolution row + README build-journey row (dropout)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:06:06 +08:00
e625aa05dd dropout: wire into model (residual sites) + train/eval switch + flag (T18)
Config.dropout (default 0). TinyTransformer gets a Cell<bool> training switch
(train()/eval()/with_training, default eval = safe) + a Cell<u64> step_seed bumped
once per training forward. forward_batched derives a per-layer block_seed (pure fn
of step_seed×layer) and block_forward derives two per-site seeds, inserting
ops::dropout at the attn and ffn sub-block outputs (before each residual). The
seed is a pure function of (step_seed, layer, site) so the checkpoint (T13)
recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout
node → graph bit-identical to pre-T18.

train_loop: model.train() per step (restored after eval flips to eval); eval_loss
runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in
eval (default), so exported weights are dropout-free (xserv closed loop unaffected).

Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads);
eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout
grads match non-recompute (fp32 + bf16).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:32 +08:00
5eb27783f8 dropout: autodiff op + fixed-seed grad-check (T18)
ops::dropout(x,p,seed): fwd runs Tensor::dropout, caches the mask in the backward
closure, bwd pushes dx=d⊙mask. p==0 returns x.clone() (no node) so the default
graph is unchanged. Tests in autograd.rs: fixed-seed finite-diff grad-check (mask
held constant across the ± perturbation — dropout is a fixed elementwise linear
map of x); E[out]≈input + keep-rate≈1-p over a seed sweep; p=0 kernel identity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:32 +08:00
1fdd0c5002 dropout: device RNG kernel + Tensor fwd/bwd (T18)
csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32
uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer
(per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16
activation variants (mask fp32 in both; uniform is dtype-independent so masks
match across precisions). Stateless → re-run with same seed = same mask (T13
recompute-safe). Registered in build.rs + FFI decls.

Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the
launches (contiguous F32/BF16, default stream, per-op sync via the kernels).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:18 +08:00
6b8c1e4e0f docs: Phase T18 — dropout design (device RNG + mask)
Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p)
scaling at train, identity at eval. New autodiff `dropout` op (fwd generates +
applies mask, bwd applies the SAME cached mask). Wired at the two residual-path
sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused
SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic
seed (so T13 recompute reproduces the same mask), train/eval switch, p=0
bit-identity, and the acceptance gates.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 00:05:08 +08:00
8bd7db16e1 docs: T16 grad-accum results — evolution row + README build-journey
dash5-verified gate numbers: accum=N bit-close to N× big batch (loss
8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches
single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big →
7.2GB accum, −74%), xserv closed loop md5-identical + token-identical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:52:32 +08:00
b06b553f99 test: drop unused Var import in grad_accum
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:49:04 +08:00
abe5ceb913 test: grad-accum equivalence + accum=1 bit-identity + DDP+accum
- grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch;
  accum_steps=1 bit-identical (max|Δ|==0) to no-accum; real train() loop
  with accum tracks a big-batch baseline over 20 AdamW steps.
- ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of
  the same effective size (loss + cross-rank + vs-baseline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:45:40 +08:00
7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps)
Accumulate grads over N micro-batches, then one AdamW step + zero_grad,
for an effective batch of N×micro at one micro-batch's activation cost.
Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates
the scaled grads) so the boundary grad equals a single step over an N×
batch. accum==1 skips the scale → bit-identical to the pre-T16 path.

DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary
(intermediate micro-steps are local-only, no NCCL); the /world average is
orthogonal to the per-micro 1/N, so the boundary grad is the effective
global-batch mean. New --accum-steps flag in both train binaries; effective
batch is printed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:45:33 +08:00
d01fec6639 docs: Phase T16 — gradient accumulation design
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:41:17 +08:00
9064ced4c2 docs: T14 flash-attention results + evolution/README rows
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:34:10 +08:00
d217f4fbd3 perf: spread flash bwd dK/dV atomics across all threads
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:27:33 +08:00
4d7b69f8d4 perf: cache softmax weights in shared mem (drop hd× redundant expf)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:24:56 +08:00
9b05f4f93f test: flash==composed bf16 uses robust mean/p99 metric (repo convention)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:19:08 +08:00
c0f0b67510 test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:17:44 +08:00
80602099dc test: scale Q/K in flash grad-check for well-conditioned grads
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:17:04 +08:00