测试的 config 极为受限，会在一个 knob 失效时就完全放弃

例如：TP8 一次失败，就不再测试，但是 TP8 + EP 是可以跑的，而且效果好。说明 codex 还没有完全的 engine 配置的理解能力

```
- 真正的“为什么不能用 TP=8”，是在后面被明确写出来的：[codex_tuning_v2.jsonl (line 270)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 和 [codex_tuning_v2.jsonl (line 272)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 说明 TP=8 在这个 FP8 的 Qwen3-MoE checkpoint 上会让 TP-sharded MoE gate/up output size = 192，而 FP8 权重量化要求的 block size 是 128，192 不能整除 128，所以模型初始化阶段不兼容。
- 这个结论在最终总结里又重复了一次：[codex_tuning_v2.jsonl (line 1154)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#)。
```

  ```
  | Rank | Exp | Config | Tput/GPU | TTFT p95 | SLO pass |
  |---|---:|---|---:|---:|---:|
  | 1 | 8 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724839 | 1.337592 | 97.58% |
  | 2 | 13 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724767 | 1.328592 | 97.58% |
  | 3 | 16 | tp4 b16 bt12288 s48 gm0.92 pc on ep off | 1448.722459 | 1.346786 | 97.75% |
  | 4 | 14 | tp4 b32 bt16384 s64 gm0.92 pc on ep off | 1448.722451 | 1.382931 | 97.58% |
  | 5 | 12 | tp4 b16 bt32768 s128 gm0.92 pc on ep off | 1448.719743 | 1.324819 | 97.58% |
  | 6 | 11 | tp4 b16 bt8192 s32 gm0.92 pc on ep off | 1448.718885 | 1.314879 | 97.83% |
  | 7 | 15 | tp4 b16 bt16384 s64 gm0.95 pc on ep off | 1448.715778 | 1.368400 | 97.50% |
  | 8 | 17 | tp4 b16 bt16384 s64 gm0.92 pc on ep on | 1448.714795 | 1.864526 | 95.58% |
  | 9 | 10 | tp4 b16 bt16384 s64 gm0.92 pc off ep off | 1448.437961 | 1.764754 | 95.50% |
  | 10 | 9 | tp2 b16 bt16384 s64 gm0.92 pc on ep off | startup failed | - | - |
  ```