Files
obsidian/projects/auto-tuner/codex-problems.md

2.0 KiB
Raw Permalink Blame History

测试的 config 极为受限,会在一个 knob 失效时就完全放弃

例如TP8 一次失败,就不再测试,但是 TP8 + EP 是可以跑的,而且效果好。说明 codex 还没有完全的 engine 配置的理解能力

- 真正的“为什么不能用 TP=8”是在后面被明确写出来的[codex_tuning_v2.jsonl (line 270)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 和 [codex_tuning_v2.jsonl (line 272)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 说明 TP=8 在这个 FP8 的 Qwen3-MoE checkpoint 上会让 TP-sharded MoE gate/up output size = 192而 FP8 权重量化要求的 block size 是 128192 不能整除 128所以模型初始化阶段不兼容。
- 这个结论在最终总结里又重复了一次:[codex_tuning_v2.jsonl (line 1154)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#)。
| Rank | Exp | Config | Tput/GPU | TTFT p95 | SLO pass |
|---|---:|---|---:|---:|---:|
| 1 | 8 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724839 | 1.337592 | 97.58% |
| 2 | 13 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724767 | 1.328592 | 97.58% |
| 3 | 16 | tp4 b16 bt12288 s48 gm0.92 pc on ep off | 1448.722459 | 1.346786 | 97.75% |
| 4 | 14 | tp4 b32 bt16384 s64 gm0.92 pc on ep off | 1448.722451 | 1.382931 | 97.58% |
| 5 | 12 | tp4 b16 bt32768 s128 gm0.92 pc on ep off | 1448.719743 | 1.324819 | 97.58% |
| 6 | 11 | tp4 b16 bt8192 s32 gm0.92 pc on ep off | 1448.718885 | 1.314879 | 97.83% |
| 7 | 15 | tp4 b16 bt16384 s64 gm0.95 pc on ep off | 1448.715778 | 1.368400 | 97.50% |
| 8 | 17 | tp4 b16 bt16384 s64 gm0.92 pc on ep on | 1448.714795 | 1.864526 | 95.58% |
| 9 | 10 | tp4 b16 bt16384 s64 gm0.92 pc off ep off | 1448.437961 | 1.764754 | 95.50% |
| 10 | 9 | tp2 b16 bt16384 s64 gm0.92 pc on ep off | startup failed | - | - |