Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1 single-GPU. Root cause + proposed fixes recorded; also consolidates deferred T7 items (bf16, activation recompute) and the large-vocab modeling note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>