xtrain

gahow/xtrain

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	a1370446fe	docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc) - known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix + regression test), with the meta-lesson that op/single-GPU unit tests can miss launcher-level integration gaps — only the V9-PILOT end-to-end run on the real launcher path exposed it. - 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap and its T21 fix. - evolution.md: T21 row (Infra) recording the fix + meta-lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:22:49 +08:00
Gahow Wang	6b8c1e4e0f	docs: Phase T18 — dropout design (device RNG + mask) Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p) scaling at train, identity at eval. New autodiff `dropout` op (fwd generates + applies mask, bwd applies the SAME cached mask). Wired at the two residual-path sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic seed (so T13 recompute reproduces the same mask), train/eval switch, p=0 bit-identity, and the acceptance gates. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:08 +08:00

Author

SHA1

Message

Date

Gahow Wang

a1370446fe

docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix +
  regression test), with the meta-lesson that op/single-GPU unit tests can
  miss launcher-level integration gaps — only the V9-PILOT end-to-end run on
  the real launcher path exposed it.
- 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap
  and its T21 fix.
- evolution.md: T21 row (Infra) recording the fix + meta-lesson.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-18 21:22:49 +08:00

Gahow Wang

6b8c1e4e0f

docs: Phase T18 — dropout design (device RNG + mask)

Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p)
scaling at train, identity at eval. New autodiff `dropout` op (fwd generates +
applies mask, bwd applies the SAME cached mask). Wired at the two residual-path
sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused
SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic
seed (so T13 recompute reproduces the same mask), train/eval switch, p=0
bit-identity, and the acceptance gates.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-18 00:05:08 +08:00

2 Commits