-
6465a2d5ce
test: T21-for-proc — clear ENV_DROPOUT across tests to sever ordering coupling
main
Gahow Wang
2026-07-01 14:09:42 +08:00
-
33a1aee9ec
test: T21-for-proc — dropout-live regression under process-per-GPU
Gahow Wang
2026-07-01 13:51:31 +08:00
-
86de6bfb51
distributed: T21-for-proc — wire --dropout into the process-per-GPU launcher
Gahow Wang
2026-07-01 13:51:17 +08:00
-
4379868f2d
docs: M2d — ragged-batching lever, 9× measured, step bottleneck → rollout
Gahow Wang
2026-06-30 23:03:28 +08:00
-
0e82b2438e
test: M2d — ragged-forward + batched-op equivalence gates + throughput bench
Gahow Wang
2026-06-30 23:03:09 +08:00
-
c2ebf62ae1
post-train: M2d — batch the GRPO training-side forwards (op + module + wiring)
Gahow Wang
2026-06-30 23:02:56 +08:00
-
41d46208a6
docs: M2c — device KV cache + the bottleneck-shift finding
Gahow Wang
2026-06-30 17:39:10 +08:00
-
3a3425960c
post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift
Gahow Wang
2026-06-30 17:38:16 +08:00
-
0f76c0fdb0
docs: M2b — batched decode results (token-identical + ~1.7x rollout, device-cache next)
Gahow Wang
2026-06-30 17:20:01 +08:00
-
361c5290fa
post-train: M4 — use M2b batched rollout in GRPO (~1.7× step)
Gahow Wang
2026-06-30 17:18:54 +08:00
-
2c9b58cb3b
post-train: M2b — batched KV-cache decode (G-way, token-identical)
Gahow Wang
2026-06-30 17:18:54 +08:00
-
096e45b845
docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)
Gahow Wang
2026-06-30 17:01:22 +08:00
-
7fb3b32fd9
post-train: M4 — GRPO actor-learner loop + cached temperature rollout
Gahow Wang
2026-06-30 16:59:05 +08:00
-
aaa77082ef
post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op)
Gahow Wang
2026-06-30 14:07:02 +08:00
-
99090465bf
docs: M3 — DPO results (infra correct, held-out correctness flat, over-optimization collapse)
Gahow Wang
2026-06-30 12:38:06 +08:00
-
2f827fd6d8
post-train: M3 — DPO pair-gen + training loop (verifiable arithmetic)
Gahow Wang
2026-06-30 12:37:01 +08:00
-
f3c764ce95
post-train: M3 — seq_logprob + dpo_loss autograd ops
Gahow Wang
2026-06-30 12:11:01 +08:00
-
b39e6e7110
docs: M2a — KV-cache decode engine results (token-identical + length-dependent speedup)
Gahow Wang
2026-06-30 12:01:10 +08:00
-
eff26a0898
post-train: M2a — KV-cache incremental decode engine (token-identical)
Gahow Wang
2026-06-30 12:00:03 +08:00
-
c88e2ab88c
post-train: M2 — decode primitives (rope_at + decode_attention)
Gahow Wang
2026-06-30 12:00:03 +08:00
-
1574e21d89
post-train: M1 — verifiable-arith eval scorer + SFT format-baseline result
Gahow Wang
2026-06-30 11:13:19 +08:00
-
cb64604496
post-train: M1 fix — enlarge arith key space + saturation guard
Gahow Wang
2026-06-29 23:28:25 +08:00
-
9c70e99ae4
post-train: M1 — verifiable arithmetic task + SFT data generator
Gahow Wang
2026-06-29 22:52:25 +08:00
-
ab32168dcc
docs: post-training stack design — SFT → KV-cache → DPO → GRPO (docs/18)
Gahow Wang
2026-06-29 22:44:25 +08:00
-
7a1fba95b5
docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check
Gahow Wang
2026-06-29 16:19:12 +08:00
-
fbf4ac2917
sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval
Gahow Wang
2026-06-29 16:19:02 +08:00
-
5c27493a90
docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Gahow Wang
2026-06-29 16:18:48 +08:00
-
a1370446fe
docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)
Gahow Wang
2026-06-18 21:22:49 +08:00
-
980605474b
test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)
Gahow Wang
2026-06-18 21:22:49 +08:00
-
81f3cf59e5
distributed: T21 — wire dropout into the DDP path (--dropout + model.train())
Gahow Wang
2026-06-18 21:08:17 +08:00
-
db70abe450
docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases)
Gahow Wang
2026-06-18 18:11:47 +08:00
-
71b0a1621f
docs: T17 process-per-GPU results — measured throughput-neutral
Gahow Wang
2026-06-18 18:03:14 +08:00
-
4abb17383a
test: process-per-GPU DDP correctness (ddp_proc.rs)
Gahow Wang
2026-06-18 17:48:52 +08:00
-
a188c8a277
distributed: train_ddp_mp bin (process-per-GPU launcher/worker)
Gahow Wang
2026-06-18 17:48:52 +08:00
-
ffd548b80b
distributed: process-per-GPU launcher + worker (proc.rs)
Gahow Wang
2026-06-18 17:48:43 +08:00
-
c470c627a7
docs: Phase T17 — process-per-GPU DDP design
Gahow Wang
2026-06-18 17:44:38 +08:00
-
2ff4573a31
docs: T15 GQA results + evolution row (模型架构) + README build-journey row
Gahow Wang
2026-06-18 01:44:58 +08:00
-
39df0b40c1
gqa: fix kv-proj shape test param indices (embed,attn_norm precede wq)
Gahow Wang
2026-06-18 01:38:42 +08:00
-
830d06ad01
gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests)
Gahow Wang
2026-06-18 01:37:16 +08:00
-
62b1cb5dc7
docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum)
Gahow Wang
2026-06-18 01:30:34 +08:00
-
4b6d3e0a79
test: flash+dropout cross-feature grad-check (Phase-2 integration)
Gahow Wang
2026-06-18 00:43:54 +08:00
-
c36cdf74d1
Merge t18-dropout into main
Gahow Wang
2026-06-18 00:41:41 +08:00
-
-
f26db882e5
Merge t16-grad-accum into main
Gahow Wang
2026-06-18 00:37:11 +08:00
-
-
9e958cb0f9
Merge t14-flash-attention into main
Gahow Wang
2026-06-18 00:35:46 +08:00
-
-
80fafa1914
docs: T18 evolution row + README build-journey row (dropout)
Gahow Wang
2026-06-18 00:06:06 +08:00
-
e625aa05dd
dropout: wire into model (residual sites) + train/eval switch + flag (T18)
Gahow Wang
2026-06-18 00:05:32 +08:00
-
5eb27783f8
dropout: autodiff op + fixed-seed grad-check (T18)
Gahow Wang
2026-06-18 00:05:32 +08:00
-
1fdd0c5002
dropout: device RNG kernel + Tensor fwd/bwd (T18)
Gahow Wang
2026-06-18 00:05:18 +08:00
-
6b8c1e4e0f
docs: Phase T18 — dropout design (device RNG + mask)
Gahow Wang
2026-06-18 00:05:08 +08:00
-
-
-
8bd7db16e1
docs: T16 grad-accum results — evolution row + README build-journey
Gahow Wang
2026-06-17 23:52:32 +08:00
-
b06b553f99
test: drop unused Var import in grad_accum
Gahow Wang
2026-06-17 23:49:04 +08:00
-
abe5ceb913
test: grad-accum equivalence + accum=1 bit-identity + DDP+accum
Gahow Wang
2026-06-17 23:45:40 +08:00
-
7a03b0054a
train+ddp: micro-batch gradient accumulation (--accum-steps)
Gahow Wang
2026-06-17 23:45:33 +08:00
-
d01fec6639
docs: Phase T16 — gradient accumulation design
Gahow Wang
2026-06-17 23:41:17 +08:00
-
-
-
9064ced4c2
docs: T14 flash-attention results + evolution/README rows
Gahow Wang
2026-06-17 23:34:10 +08:00
-
d217f4fbd3
perf: spread flash bwd dK/dV atomics across all threads
Gahow Wang
2026-06-17 23:27:33 +08:00
-
4d7b69f8d4
perf: cache softmax weights in shared mem (drop hd× redundant expf)
Gahow Wang
2026-06-17 23:24:56 +08:00
-
9b05f4f93f
test: flash==composed bf16 uses robust mean/p99 metric (repo convention)
Gahow Wang
2026-06-17 23:19:08 +08:00
-
c0f0b67510
test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term)
Gahow Wang
2026-06-17 23:17:44 +08:00
-
80602099dc
test: scale Q/K in flash grad-check for well-conditioned grads
Gahow Wang
2026-06-17 23:17:04 +08:00
-
f38beb0346
test: flash finite-diff grad-check uses single-tile clean regime
Gahow Wang
2026-06-17 23:16:20 +08:00
-
01fb22d114
test: flash bwd vs composed bwd (sharper than finite-diff)
Gahow Wang
2026-06-17 23:12:30 +08:00
-
5f3b81ac96
test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag
Gahow Wang
2026-06-17 23:10:39 +08:00
-
0e20821633
autodiff+model: flash-attention op + --flash opt-in wiring
Gahow Wang
2026-06-17 23:10:32 +08:00
-
326a6fadfe
cuda: fused flash-attention kernel (fwd + flash-style bwd)
Gahow Wang
2026-06-17 23:10:25 +08:00
-
65a2264227
docs: Phase T14 — fused flash-attention design
Gahow Wang
2026-06-17 23:10:16 +08:00
-
-
31cc2bf745
docs: capstone README — full-stack + scaling study (v0-v8) writeup
Gahow Wang
2026-06-17 16:17:26 +08:00
-
511f35d40c
docs: run v8 — dim1024 capacity helps (val 2.98)
Gahow Wang
2026-06-17 15:12:01 +08:00
-
0150263055
perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Gahow Wang
2026-06-17 09:50:29 +08:00
-
69c5f07359
docs: Phase T13 — activation recompute
Gahow Wang
2026-06-17 09:43:56 +08:00
-
f202351be5
model: per-block activation recompute (--recompute)
Gahow Wang
2026-06-17 09:42:42 +08:00
-
c396b39483
autodiff: checkpoint primitive (recompute-on-backward)
Gahow Wang
2026-06-17 09:42:31 +08:00
-
9c557f0609
docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01)
Gahow Wang
2026-06-17 03:55:47 +08:00
-
b4bb426d48
docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution)
Gahow Wang
2026-06-16 22:21:43 +08:00
-
88bec270af
docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes
Gahow Wang
2026-06-16 19:30:52 +08:00
-
7e5ea9976b
data: FineWeb-edu parquet->txt prep script (Scaling v6)
Gahow Wang
2026-06-16 19:04:45 +08:00
-
579365f4a0
docs: run v5 — TinyStories saturation at dim768 (val 1.11)
Gahow Wang
2026-06-16 17:56:25 +08:00
-
8a1e29543b
run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11)
Gahow Wang
2026-06-16 17:56:25 +08:00
-
5b7dde1736
test: bf16 test reads f32-cast logits (forward now returns bf16)
Gahow Wang
2026-06-16 14:29:24 +08:00
-
320c1ae4fb
perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K
Gahow Wang
2026-06-16 14:28:20 +08:00
-
48922cb628
perf: keep bf16 logits (no persistent fp32 logits buffer)
Gahow Wang
2026-06-16 14:20:48 +08:00
-
30db62d8f2
docs: Phase T12 — bf16 mixed precision design
Gahow Wang
2026-06-16 14:15:02 +08:00
-
0a2a4dcaa8
train: --bf16 flag (fp32-master AMP) + bf16 correctness test
Gahow Wang
2026-06-16 14:14:55 +08:00
-
b0086b5214
autodiff: bf16 mixed-precision path (fp32 master via cast op)
Gahow Wang
2026-06-16 14:14:48 +08:00
-
d05115ddf3
cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels
Gahow Wang
2026-06-16 14:14:39 +08:00
-
511ceebbb3
docs: KI-2 trigger — dim768 fp32 batch-32 OOM
Gahow Wang
2026-06-16 13:14:42 +08:00
-
ff79fee3c5
docs: run v4 — TinyStories, dim768, val 1.17
Gahow Wang
2026-06-16 13:14:37 +08:00
-
734e119db3
run: v4 archive + export (dim768, 8-GPU DDP, val 1.17)
Gahow Wang
2026-06-16 13:14:28 +08:00
-
f85bd4d276
perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8
Gahow Wang
2026-06-16 11:15:02 +08:00
-
4c3f332f64
docs: Phase T11 — caching allocator
Gahow Wang
2026-06-16 11:04:11 +08:00
-
b7104e2cb7
test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8
Gahow Wang
2026-06-16 11:04:11 +08:00
-
28801fbfe5
cuda: device caching allocator (pool GpuBuffer alloc)
Gahow Wang
2026-06-16 11:04:02 +08:00
-
d422c68704
docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky)
Gahow Wang
2026-06-16 09:42:13 +08:00
-
84092fb28d
docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11)
Gahow Wang
2026-06-16 09:40:45 +08:00
-
88c2c15768
Revert "dist: coalesce grads into buckets for all-reduce (KI-5)"
Gahow Wang
2026-06-16 09:39:38 +08:00
-
b8b58212dc
dist: coalesce grads into buckets for all-reduce (KI-5)
Gahow Wang
2026-06-16 09:09:44 +08:00
-
a78502e0f0
docs: run v3 — TinyStories, dim512, val 1.30
Gahow Wang
2026-06-16 03:37:45 +08:00
-
64b2a8c09e
run: v3 archive + export (dim512, single-GPU batched, val 1.30)
Gahow Wang
2026-06-16 03:37:36 +08:00
-
9a25616a30
docs: Phase T10 — batched forward
Gahow Wang
2026-06-16 00:44:50 +08:00
-
4ccab0fb42
perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Gahow Wang
2026-06-16 00:44:43 +08:00