Gahow Wang

gahow pushed to t16-grad-accum at gahow/xtrain

2026-06-17 15:49:18 +00:00

b06b553f99 test: drop unused Var import in grad_accum

gahow pushed to t16-grad-accum at gahow/xtrain

2026-06-17 15:45:48 +00:00

abe5ceb913 test: grad-accum equivalence + accum=1 bit-identity + DDP+accum

7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps)

d01fec6639 docs: Phase T16 — gradient accumulation design

Compare 3 commits »

gahow created branch t16-grad-accum in gahow/xtrain

2026-06-17 15:45:48 +00:00

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:34:17 +00:00

9064ced4c2 docs: T14 flash-attention results + evolution/README rows

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:27:34 +00:00

d217f4fbd3 perf: spread flash bwd dK/dV atomics across all threads

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:24:58 +00:00

4d7b69f8d4 perf: cache softmax weights in shared mem (drop hd× redundant expf)

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:19:09 +00:00

9b05f4f93f test: flash==composed bf16 uses robust mean/p99 metric (repo convention)

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:17:46 +00:00

c0f0b67510 test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term)

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:17:07 +00:00

80602099dc test: scale Q/K in flash grad-check for well-conditioned grads

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:16:22 +00:00

f38beb0346 test: flash finite-diff grad-check uses single-tile clean regime

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:12:30 +00:00

01fb22d114 test: flash bwd vs composed bwd (sharper than finite-diff)

gahow pushed to t14-flash-attention at gahow/xtrain

2026-06-17 15:10:47 +00:00

5f3b81ac96 test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag

0e20821633 autodiff+model: flash-attention op + --flash opt-in wiring

326a6fadfe cuda: fused flash-attention kernel (fwd + flash-style bwd)

65a2264227 docs: Phase T14 — fused flash-attention design

Compare 4 commits »

gahow created branch t14-flash-attention in gahow/xtrain

2026-06-17 15:10:47 +00:00

gahow pushed to feat/fig18-real-output-lca-substrate at gahow/aituner

2026-06-17 14:11:53 +00:00

a1b804f879 Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes)

gahow pushed to feat/fig18-real-output-lca-substrate at gahow/aituner

2026-06-17 09:24:06 +00:00

0c23285f39 Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline

gahow created branch feat/fig18-real-output-lca-substrate in gahow/aituner

2026-06-17 09:24:06 +00:00

gahow pushed to main at gahow/xtrain

2026-06-17 08:17:27 +00:00

31cc2bf745 docs: capstone README — full-stack + scaling study (v0-v8) writeup

gahow pushed to main at gahow/xtrain

2026-06-17 07:12:05 +00:00

511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98)

gahow pushed to main at gahow/aituner

2026-06-17 05:03:27 +00:00

816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic

gahow pushed to main at gahow/aituner

2026-06-17 02:05:46 +00:00

97d2ddabb1 Ablation driver: force direct LLM connection (codex proxy is dash0-local)