Commit Graph

  • 6465a2d5ce test: T21-for-proc — clear ENV_DROPOUT across tests to sever ordering coupling main Gahow Wang 2026-07-01 14:09:42 +08:00
  • 33a1aee9ec test: T21-for-proc — dropout-live regression under process-per-GPU Gahow Wang 2026-07-01 13:51:31 +08:00
  • 86de6bfb51 distributed: T21-for-proc — wire --dropout into the process-per-GPU launcher Gahow Wang 2026-07-01 13:51:17 +08:00
  • 4379868f2d docs: M2d — ragged-batching lever, 9× measured, step bottleneck → rollout Gahow Wang 2026-06-30 23:03:28 +08:00
  • 0e82b2438e test: M2d — ragged-forward + batched-op equivalence gates + throughput bench Gahow Wang 2026-06-30 23:03:09 +08:00
  • c2ebf62ae1 post-train: M2d — batch the GRPO training-side forwards (op + module + wiring) Gahow Wang 2026-06-30 23:02:56 +08:00
  • 41d46208a6 docs: M2c — device KV cache + the bottleneck-shift finding Gahow Wang 2026-06-30 17:39:10 +08:00
  • 3a3425960c post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift Gahow Wang 2026-06-30 17:38:16 +08:00
  • 0f76c0fdb0 docs: M2b — batched decode results (token-identical + ~1.7x rollout, device-cache next) Gahow Wang 2026-06-30 17:20:01 +08:00
  • 361c5290fa post-train: M4 — use M2b batched rollout in GRPO (~1.7× step) Gahow Wang 2026-06-30 17:18:54 +08:00
  • 2c9b58cb3b post-train: M2b — batched KV-cache decode (G-way, token-identical) Gahow Wang 2026-06-30 17:18:54 +08:00
  • 096e45b845 docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result) Gahow Wang 2026-06-30 17:01:22 +08:00
  • 7fb3b32fd9 post-train: M4 — GRPO actor-learner loop + cached temperature rollout Gahow Wang 2026-06-30 16:59:05 +08:00
  • aaa77082ef post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op) Gahow Wang 2026-06-30 14:07:02 +08:00
  • 99090465bf docs: M3 — DPO results (infra correct, held-out correctness flat, over-optimization collapse) Gahow Wang 2026-06-30 12:38:06 +08:00
  • 2f827fd6d8 post-train: M3 — DPO pair-gen + training loop (verifiable arithmetic) Gahow Wang 2026-06-30 12:37:01 +08:00
  • f3c764ce95 post-train: M3 — seq_logprob + dpo_loss autograd ops Gahow Wang 2026-06-30 12:11:01 +08:00
  • b39e6e7110 docs: M2a — KV-cache decode engine results (token-identical + length-dependent speedup) Gahow Wang 2026-06-30 12:01:10 +08:00
  • eff26a0898 post-train: M2a — KV-cache incremental decode engine (token-identical) Gahow Wang 2026-06-30 12:00:03 +08:00
  • c88e2ab88c post-train: M2 — decode primitives (rope_at + decode_attention) Gahow Wang 2026-06-30 12:00:03 +08:00
  • 1574e21d89 post-train: M1 — verifiable-arith eval scorer + SFT format-baseline result Gahow Wang 2026-06-30 11:13:19 +08:00
  • cb64604496 post-train: M1 fix — enlarge arith key space + saturation guard Gahow Wang 2026-06-29 23:28:25 +08:00
  • 9c70e99ae4 post-train: M1 — verifiable arithmetic task + SFT data generator Gahow Wang 2026-06-29 22:52:25 +08:00
  • ab32168dcc docs: post-training stack design — SFT → KV-cache → DPO → GRPO (docs/18) Gahow Wang 2026-06-29 22:44:25 +08:00
  • 7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check Gahow Wang 2026-06-29 16:19:12 +08:00
  • fbf4ac2917 sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval Gahow Wang 2026-06-29 16:19:02 +08:00
  • 5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases Gahow Wang 2026-06-29 16:18:48 +08:00
  • a1370446fe docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc) Gahow Wang 2026-06-18 21:22:49 +08:00
  • 980605474b test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical) Gahow Wang 2026-06-18 21:22:49 +08:00
  • 81f3cf59e5 distributed: T21 — wire dropout into the DDP path (--dropout + model.train()) Gahow Wang 2026-06-18 21:08:17 +08:00
  • db70abe450 docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases) Gahow Wang 2026-06-18 18:11:47 +08:00
  • 71b0a1621f docs: T17 process-per-GPU results — measured throughput-neutral Gahow Wang 2026-06-18 18:03:14 +08:00
  • 4abb17383a test: process-per-GPU DDP correctness (ddp_proc.rs) Gahow Wang 2026-06-18 17:48:52 +08:00
  • a188c8a277 distributed: train_ddp_mp bin (process-per-GPU launcher/worker) Gahow Wang 2026-06-18 17:48:52 +08:00
  • ffd548b80b distributed: process-per-GPU launcher + worker (proc.rs) Gahow Wang 2026-06-18 17:48:43 +08:00
  • c470c627a7 docs: Phase T17 — process-per-GPU DDP design Gahow Wang 2026-06-18 17:44:38 +08:00
  • 2ff4573a31 docs: T15 GQA results + evolution row (模型架构) + README build-journey row Gahow Wang 2026-06-18 01:44:58 +08:00
  • 39df0b40c1 gqa: fix kv-proj shape test param indices (embed,attn_norm precede wq) Gahow Wang 2026-06-18 01:38:42 +08:00
  • 830d06ad01 gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) Gahow Wang 2026-06-18 01:37:16 +08:00
  • 62b1cb5dc7 docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum) Gahow Wang 2026-06-18 01:30:34 +08:00
  • 4b6d3e0a79 test: flash+dropout cross-feature grad-check (Phase-2 integration) Gahow Wang 2026-06-18 00:43:54 +08:00
  • c36cdf74d1 Merge t18-dropout into main Gahow Wang 2026-06-18 00:41:41 +08:00
  • f26db882e5 Merge t16-grad-accum into main Gahow Wang 2026-06-18 00:37:11 +08:00
  • 9e958cb0f9 Merge t14-flash-attention into main Gahow Wang 2026-06-18 00:35:46 +08:00
  • 80fafa1914 docs: T18 evolution row + README build-journey row (dropout) Gahow Wang 2026-06-18 00:06:06 +08:00
  • e625aa05dd dropout: wire into model (residual sites) + train/eval switch + flag (T18) Gahow Wang 2026-06-18 00:05:32 +08:00
  • 5eb27783f8 dropout: autodiff op + fixed-seed grad-check (T18) Gahow Wang 2026-06-18 00:05:32 +08:00
  • 1fdd0c5002 dropout: device RNG kernel + Tensor fwd/bwd (T18) Gahow Wang 2026-06-18 00:05:18 +08:00
  • 6b8c1e4e0f docs: Phase T18 — dropout design (device RNG + mask) Gahow Wang 2026-06-18 00:05:08 +08:00
  • 8bd7db16e1 docs: T16 grad-accum results — evolution row + README build-journey Gahow Wang 2026-06-17 23:52:32 +08:00
  • b06b553f99 test: drop unused Var import in grad_accum Gahow Wang 2026-06-17 23:49:04 +08:00
  • abe5ceb913 test: grad-accum equivalence + accum=1 bit-identity + DDP+accum Gahow Wang 2026-06-17 23:45:40 +08:00
  • 7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps) Gahow Wang 2026-06-17 23:45:33 +08:00
  • d01fec6639 docs: Phase T16 — gradient accumulation design Gahow Wang 2026-06-17 23:41:17 +08:00
  • 9064ced4c2 docs: T14 flash-attention results + evolution/README rows Gahow Wang 2026-06-17 23:34:10 +08:00
  • d217f4fbd3 perf: spread flash bwd dK/dV atomics across all threads Gahow Wang 2026-06-17 23:27:33 +08:00
  • 4d7b69f8d4 perf: cache softmax weights in shared mem (drop hd× redundant expf) Gahow Wang 2026-06-17 23:24:56 +08:00
  • 9b05f4f93f test: flash==composed bf16 uses robust mean/p99 metric (repo convention) Gahow Wang 2026-06-17 23:19:08 +08:00
  • c0f0b67510 test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term) Gahow Wang 2026-06-17 23:17:44 +08:00
  • 80602099dc test: scale Q/K in flash grad-check for well-conditioned grads Gahow Wang 2026-06-17 23:17:04 +08:00
  • f38beb0346 test: flash finite-diff grad-check uses single-tile clean regime Gahow Wang 2026-06-17 23:16:20 +08:00
  • 01fb22d114 test: flash bwd vs composed bwd (sharper than finite-diff) Gahow Wang 2026-06-17 23:12:30 +08:00
  • 5f3b81ac96 test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag Gahow Wang 2026-06-17 23:10:39 +08:00
  • 0e20821633 autodiff+model: flash-attention op + --flash opt-in wiring Gahow Wang 2026-06-17 23:10:32 +08:00
  • 326a6fadfe cuda: fused flash-attention kernel (fwd + flash-style bwd) Gahow Wang 2026-06-17 23:10:25 +08:00
  • 65a2264227 docs: Phase T14 — fused flash-attention design Gahow Wang 2026-06-17 23:10:16 +08:00
  • 31cc2bf745 docs: capstone README — full-stack + scaling study (v0-v8) writeup Gahow Wang 2026-06-17 16:17:26 +08:00
  • 511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98) Gahow Wang 2026-06-17 15:12:01 +08:00
  • 0150263055 perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K Gahow Wang 2026-06-17 09:50:29 +08:00
  • 69c5f07359 docs: Phase T13 — activation recompute Gahow Wang 2026-06-17 09:43:56 +08:00
  • f202351be5 model: per-block activation recompute (--recompute) Gahow Wang 2026-06-17 09:42:42 +08:00
  • c396b39483 autodiff: checkpoint primitive (recompute-on-backward) Gahow Wang 2026-06-17 09:42:31 +08:00
  • 9c557f0609 docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01) Gahow Wang 2026-06-17 03:55:47 +08:00
  • b4bb426d48 docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution) Gahow Wang 2026-06-16 22:21:43 +08:00
  • 88bec270af docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes Gahow Wang 2026-06-16 19:30:52 +08:00
  • 7e5ea9976b data: FineWeb-edu parquet->txt prep script (Scaling v6) Gahow Wang 2026-06-16 19:04:45 +08:00
  • 579365f4a0 docs: run v5 — TinyStories saturation at dim768 (val 1.11) Gahow Wang 2026-06-16 17:56:25 +08:00
  • 8a1e29543b run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11) Gahow Wang 2026-06-16 17:56:25 +08:00
  • 5b7dde1736 test: bf16 test reads f32-cast logits (forward now returns bf16) Gahow Wang 2026-06-16 14:29:24 +08:00
  • 320c1ae4fb perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K Gahow Wang 2026-06-16 14:28:20 +08:00
  • 48922cb628 perf: keep bf16 logits (no persistent fp32 logits buffer) Gahow Wang 2026-06-16 14:20:48 +08:00
  • 30db62d8f2 docs: Phase T12 — bf16 mixed precision design Gahow Wang 2026-06-16 14:15:02 +08:00
  • 0a2a4dcaa8 train: --bf16 flag (fp32-master AMP) + bf16 correctness test Gahow Wang 2026-06-16 14:14:55 +08:00
  • b0086b5214 autodiff: bf16 mixed-precision path (fp32 master via cast op) Gahow Wang 2026-06-16 14:14:48 +08:00
  • d05115ddf3 cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels Gahow Wang 2026-06-16 14:14:39 +08:00
  • 511ceebbb3 docs: KI-2 trigger — dim768 fp32 batch-32 OOM Gahow Wang 2026-06-16 13:14:42 +08:00
  • ff79fee3c5 docs: run v4 — TinyStories, dim768, val 1.17 Gahow Wang 2026-06-16 13:14:37 +08:00
  • 734e119db3 run: v4 archive + export (dim768, 8-GPU DDP, val 1.17) Gahow Wang 2026-06-16 13:14:28 +08:00
  • f85bd4d276 perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8 Gahow Wang 2026-06-16 11:15:02 +08:00
  • 4c3f332f64 docs: Phase T11 — caching allocator Gahow Wang 2026-06-16 11:04:11 +08:00
  • b7104e2cb7 test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8 Gahow Wang 2026-06-16 11:04:11 +08:00
  • 28801fbfe5 cuda: device caching allocator (pool GpuBuffer alloc) Gahow Wang 2026-06-16 11:04:02 +08:00
  • d422c68704 docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky) Gahow Wang 2026-06-16 09:42:13 +08:00
  • 84092fb28d docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11) Gahow Wang 2026-06-16 09:40:45 +08:00
  • 88c2c15768 Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" Gahow Wang 2026-06-16 09:39:38 +08:00
  • b8b58212dc dist: coalesce grads into buckets for all-reduce (KI-5) Gahow Wang 2026-06-16 09:09:44 +08:00
  • a78502e0f0 docs: run v3 — TinyStories, dim512, val 1.30 Gahow Wang 2026-06-16 03:37:45 +08:00
  • 64b2a8c09e run: v3 archive + export (dim512, single-GPU batched, val 1.30) Gahow Wang 2026-06-16 03:37:36 +08:00
  • 9a25616a30 docs: Phase T10 — batched forward Gahow Wang 2026-06-16 00:44:50 +08:00
  • 4ccab0fb42 perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x) Gahow Wang 2026-06-16 00:44:43 +08:00