Gahow Wang
320c1ae4fb
perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K
bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32
OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX
5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state):
config batch peak mem tok/s fits 32GB
fp32 16 27.2 GB 31.5K yes
bf16 16 19.3 GB 35.5K yes (-29% mem / +13% tok/s)
fp32 32 — — OOM
bf16 32 31.1 GB 40.8K yes (+29% vs fp32-b16)
Verified on dash5: fp32 suite green at tight tol + xserv export md5
bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99
6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs
3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly.
Mark KI-2 FIXED; fill docs/11 results.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:28:20 +08:00
..
2026-06-16 13:14:37 +08:00
2026-06-15 15:12:55 +08:00
2026-06-15 15:12:55 +08:00
2026-06-15 15:27:03 +08:00
2026-06-15 15:53:55 +08:00
2026-06-15 16:09:30 +08:00
2026-06-15 16:30:14 +08:00
2026-06-15 17:00:29 +08:00
2026-06-15 17:15:49 +08:00
2026-06-15 17:37:46 +08:00
2026-06-16 00:44:50 +08:00
2026-06-16 11:15:02 +08:00
2026-06-16 14:28:20 +08:00
2026-06-16 14:28:20 +08:00