测试的 config 极为受限,会在一个 knob 失效时就完全放弃 例如:TP8 一次失败,就不再测试,但是 TP8 + EP 是可以跑的,而且效果好。说明 codex 还没有完全的 engine 配置的理解能力 ``` - 真正的“为什么不能用 TP=8”,是在后面被明确写出来的:[codex_tuning_v2.jsonl (line 270)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 和 [codex_tuning_v2.jsonl (line 272)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 说明 TP=8 在这个 FP8 的 Qwen3-MoE checkpoint 上会让 TP-sharded MoE gate/up output size = 192,而 FP8 权重量化要求的 block size 是 128,192 不能整除 128,所以模型初始化阶段不兼容。 - 这个结论在最终总结里又重复了一次:[codex_tuning_v2.jsonl (line 1154)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#)。 ``` ``` | Rank | Exp | Config | Tput/GPU | TTFT p95 | SLO pass | |---|---:|---|---:|---:|---:| | 1 | 8 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724839 | 1.337592 | 97.58% | | 2 | 13 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724767 | 1.328592 | 97.58% | | 3 | 16 | tp4 b16 bt12288 s48 gm0.92 pc on ep off | 1448.722459 | 1.346786 | 97.75% | | 4 | 14 | tp4 b32 bt16384 s64 gm0.92 pc on ep off | 1448.722451 | 1.382931 | 97.58% | | 5 | 12 | tp4 b16 bt32768 s128 gm0.92 pc on ep off | 1448.719743 | 1.324819 | 97.58% | | 6 | 11 | tp4 b16 bt8192 s32 gm0.92 pc on ep off | 1448.718885 | 1.314879 | 97.83% | | 7 | 15 | tp4 b16 bt16384 s64 gm0.95 pc on ep off | 1448.715778 | 1.368400 | 97.50% | | 8 | 17 | tp4 b16 bt16384 s64 gm0.92 pc on ep on | 1448.714795 | 1.864526 | 95.58% | | 9 | 10 | tp4 b16 bt16384 s64 gm0.92 pc off ep off | 1448.437961 | 1.764754 | 95.50% | | 10 | 9 | tp2 b16 bt16384 s64 gm0.92 pc on ep off | startup failed | - | - | ```