xtrain/docs at 2c9b58cb3b65f6f6b85ded888ec77832e3bd0b57 - xtrain - Local Gitea

gahow/xtrain

Files

History

Gahow Wang 096e45b845 docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)

Implementation log (docs/18) + Phase-3 row (evolution.md): the clipped_pg_loss
op + gates, the actor-learner loop, the easy-task SFT baseline (held-out 18.7%,
plateaus → no generalization), the two systems walls the design doc flagged
(two 1B models OOM the 32GB box → β=0; naive rollout fragments the allocator →
cached temperature sampling, rollout still the long pole), and the result:
format holds, held-out 20.0% (+1.3pp, statistically flat) — the same wall as
DPO. Closes the SFT→KV-cache→DPO→GRPO post-training arc with honest limits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 17:01:22 +08:00

..

docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check

2026-06-29 16:19:12 +08:00

00-build-chain.md

docs: backfill T1 build-chain

2026-06-15 15:12:55 +08:00

01-tensor.md

docs: Phase T2 — tensor abstraction

2026-06-15 15:12:55 +08:00

02-gemm-autodiff.md

docs: Phase T3 — GEMM fwd/bwd + finite-diff

2026-06-15 15:27:03 +08:00

03-autograd-engine.md

docs: Phase T4 — autograd engine

2026-06-15 15:53:55 +08:00

04-tiny-transformer.md

docs: Phase T5 — tiny transformer

2026-06-15 16:09:30 +08:00

05-training-loop.md

docs: Phase T6 — training loop

2026-06-15 16:30:14 +08:00

06-performance.md

docs: Phase T7 — performance

2026-06-15 17:00:29 +08:00

07-distributed.md

docs: Phase T8 — distributed data parallel

2026-06-15 17:15:49 +08:00

08-export-xserv.md

docs: T9 verification results (xserv == xtrain, dash5)

2026-06-15 17:37:46 +08:00

09-batched-forward.md

docs: Phase T10 — batched forward

2026-06-16 00:44:50 +08:00

10-caching-allocator.md

perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8

2026-06-16 11:15:02 +08:00

11-bf16-mixed-precision.md

perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K

2026-06-16 14:28:20 +08:00

12-activation-recompute.md

perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K

2026-06-17 09:50:29 +08:00

13-flash-attention.md

docs: T14 flash-attention results + evolution/README rows

2026-06-17 23:34:10 +08:00

14-gqa.md

docs: T15 GQA results + evolution row (模型架构) + README build-journey row

2026-06-18 01:44:58 +08:00

15-grad-accum.md

docs: Phase T16 — gradient accumulation design

2026-06-17 23:41:17 +08:00

16-process-per-gpu.md

docs: T17 process-per-GPU results — measured throughput-neutral

2026-06-18 18:03:14 +08:00

17-dropout.md

docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

2026-06-18 21:22:49 +08:00

18-post-training-rl-sft.md

docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)

2026-06-30 17:01:22 +08:00

evolution.md

docs: M4 — GRPO results (infra + memory/rollout walls + capability-wall negative result)

2026-06-30 17:01:22 +08:00

known-issues.md

docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

2026-06-18 21:22:49 +08:00