Files
aituner/docs/harness-ablation/qwen235b-prefill-2x2-progress-20260623.md

139 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Qwen235B prefill 2x2 progress - 2026-06-23
Snapshot: 2026-06-23 18:24 CST / 10:24 UTC.
本文整理当前 dash1/dash2/dash3 上的 Qwen235B prefill 2x2 实验进度。这个
case 仍在跑 strong-model arm因此本文是 progress report不是最终 aggregate
结论。
## 当前远端状态
| Host | 当前状态 | 说明 |
| --- | --- | --- |
| dash1 | running | `aituner-q235b-2x2-gpt55-20260623T010038Z` 仍在跑,当前是 `gpt-5.5 + naive` 的 trial-00048 张 H20 被 vLLM 占用。 |
| dash2 | idle | 没有 tmux/GPU 任务;最近完成的是 `qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z` harness-only 验证。 |
| dash3 | idle | 没有 tmux/GPU 任务;`gpt-5.4-mini` 2x2 arm 已完成并生成 report。 |
注意:三台机器共享 `/home/admin/cpfs/wjh/aituner/aituner`,所以 `.aituner`
`.aituner-reports` 在不同 dash 节点上看到的是同一批产物。
## 已完成gpt-5.4-mini 2x2 arm
Report:
```text
.aituner-reports/qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z/report.md
```
Aggregate:
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `harness` | harness | 8 | 0.3217 | 1.0000 | 3 | 0.9483 | 0 | 1 |
| `naive` | naive | 8 | - | - | - | 0.0000 | 2 | 8 |
Interpretation:
- `gpt-5.4-mini + harness` 找到了 `0.3217 req/s/GPU`,达到该 report 的
reference best。
- `gpt-5.4-mini + naive` 8 个 trials 都没有找到 feasible config其中 2 个是
engine launch failure。
- Report 中 `Harness-vs-naive pass/checks: 0/1` 是 aggregator 对
`best_naive_final_per_gpu = null` 的保守处理:因为 naive 没有 feasible best
final ratio 无法计算,所以 pass 记为 false。就实际 tuning 结果而言,这个 arm
是 harness dominates naive。
Harness trajectory:
| Trial | Patch | req/s/GPU | Pass rate | 说明 |
| ---: | --- | ---: | ---: | --- |
| 1 | `TP=8, DP=1` | 0.2879 | 0.9522 | 初始 topology 满足 SLO但未达到最终 best。 |
| 2 | `TP=8, max-num-seqs=96` | 0.2879 | 0.9537 | 单独调 `max-num-seqs` 无明显提升。 |
| 3 | `TP=8, max-num-batched-tokens=16384, max-num-seqs=96` | 0.3085 | 0.9568 | joint runtime probe 提升。 |
| 4 | `TP=8, max-num-seqs=144, max-num-batched-tokens=32768` | 0.2879 | 0.9530 | 过大的 batching/seq 组合回退。 |
| 5 | `TP=4, DP=2` | - | - | 无 feasible best说明 DP-heavy/mixed topology 不解决该 prefill path。 |
| 6 | `TP=8, max-num-seqs=96, max-num-batched-tokens=24576` | 0.2708 | 0.9523 | batching 进一步增大后回退。 |
| 7 | `TP=4, DP=1, max-num-seqs=96, max-num-batched-tokens=16384` | 0.2338 | 0.9590 | 少用 GPU 的 TP4/DP1 per-GPU 不占优。 |
| 8 | `TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384` | 0.3217 | 0.9508 | 当前 best。 |
这个结果说明:在 Qwen235B prefill case 上harness 的价值不只是 topology
选择,还包括在 TTFT/prefill 方向下做受约束的 runtime joint probe。最终 best 是
`TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384`
## 正在运行gpt-5.5 2x2 arm
Session:
```text
tmux: aituner-q235b-2x2-gpt55-20260623T010038Z
driver log: .aituner/qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z.driver.log
```
Driver timeline:
```text
harness clean pair start 2026-06-23T01:00:40+00:00
harness clean pair done 2026-06-23T08:21:13+00:00
naive clean pair start 2026-06-23T08:21:13+00:00
```
Harness side has completed all 8 trials:
| Trial | Patch | req/s/GPU | Pass rate |
| ---: | --- | ---: | ---: |
| 1 | `TP=8, DP=1` | 0.2879 | 0.9522 |
| 2 | `TP=8, max-num-seqs=96` | 0.2879 | 0.9530 |
| 3 | `TP=8, max-num-batched-tokens=16384, max-num-seqs=96` | 0.3085 | 0.9561 |
| 4 | `TP=8, max-num-batched-tokens=32768, max-num-seqs=144` | 0.2783 | 0.9543 |
| 5 | `TP=8, DP=1, max-num-batched-tokens=24576, max-num-seqs=96` | 0.2654 | 0.9513 |
| 6 | `TP=4, DP=2, max-num-batched-tokens=16384, max-num-seqs=96` | - | - |
| 7 | `TP=8, DP=1, max-num-batched-tokens=16384, max-num-seqs=80` | 0.3156 | 0.9505 |
| 8 | `TP=8, max-num-batched-tokens=32768, max-num-seqs=120` | 0.2879 | 0.9508 |
Current harness best: `trial-0007`, `0.3156 req/s/GPU`.
Naive side is still running. Current state:
- Completed/recorded through trial-0003, with current best `0.2879 req/s/GPU`.
- trial-0004 is active with `TP=8, DP=1, max-num-batched-tokens=8192,
max-num-seqs=128`.
- trial-0004 probe history so far:
| threshold | request rate | req/s/GPU | pass rate | feasible | main failures |
| ---: | ---: | ---: | ---: | --- | --- |
| 0.0625 | 1.5750 | 0.1969 | 0.9651 | true | TTFT misses and TTFT threshold violations |
| 0.09375 | 2.3650 | 0.2956 | 0.7308 | false | `slo_pass_rate_unrecoverable`, TTFT violations |
| 0.078125 | 1.9567 | 0.2446 | 0.9591 | true | TTFT misses and TTFT threshold violations |
| 0.0859375 | 2.1667 | 0.2708 | 0.9546 | true | TTFT misses and TTFT threshold violations |
As of the snapshot, vLLM is still processing requests for trial-0004, so the naive
side has not produced its final result or report yet.
## Prior Qwen235B context
These earlier runs explain why the current 2x2 matters:
| Run | Result | What it showed |
| --- | --- | --- |
| `qwen235b-prefill-clean-gpt55-dash1-20260621T160712Z` | harness 0.2879, naive 0.3217 | Earlier harness stopped/refined too weakly; naive found better final config. |
| `qwen235b-prefill-seqguard-gpt55-dash1-20260622T064445Z` | harness 0.2879, naive 0.2577 | Seq guard prevented the worst early-stop failure but still did not reach the old naive best. |
| `qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z` | harness-only 0.3085 | Joint `max-num-batched-tokens + max-num-seqs` probe improved over seqguard. |
| `qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z` | harness 0.3217, naive no feasible | Weak model plus harness now reaches the old best and dominates weak naive. |
The current evidence points to the harness needing both:
1. topology discipline: stay on `TP=8, DP=1` for this prefill-heavy 235B setup;
2. runtime joint probing: tune `max-num-batched-tokens` and `max-num-seqs` together
instead of stopping after the first feasible TP8 result.
## Open item
The final Qwen235B 2x2 conclusion is blocked on the still-running
`gpt-5.5 + naive` arm on dash1. Once it completes, generate an aggregate report
combining:
- `qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z`
- `qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z`
and then update this progress report into a final ablation report.