Document Qwen235B 2x2 progress
This commit is contained in:
137
docs/harness-ablation/qwen235b-prefill-2x2-progress-20260623.md
Normal file
137
docs/harness-ablation/qwen235b-prefill-2x2-progress-20260623.md
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
# Qwen235B prefill 2x2 progress - 2026-06-23
|
||||||
|
|
||||||
|
Snapshot: 2026-06-23 18:22 CST / 10:22 UTC.
|
||||||
|
|
||||||
|
本文整理当前 dash1/dash2/dash3 上的 Qwen235B prefill 2x2 实验进度。这个
|
||||||
|
case 仍在跑 strong-model arm,因此本文是 progress report,不是最终 aggregate
|
||||||
|
结论。
|
||||||
|
|
||||||
|
## 当前远端状态
|
||||||
|
|
||||||
|
| Host | 当前状态 | 说明 |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| dash1 | running | `aituner-q235b-2x2-gpt55-20260623T010038Z` 仍在跑,当前是 `gpt-5.5 + naive` 的 trial-0004;8 张 H20 被 vLLM 占用。 |
|
||||||
|
| dash2 | idle | 没有 tmux/GPU 任务;最近完成的是 `qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z` harness-only 验证。 |
|
||||||
|
| dash3 | idle | 没有 tmux/GPU 任务;`gpt-5.4-mini` 2x2 arm 已完成并生成 report。 |
|
||||||
|
|
||||||
|
注意:三台机器共享 `/home/admin/cpfs/wjh/aituner/aituner`,所以 `.aituner` 和
|
||||||
|
`.aituner-reports` 在不同 dash 节点上看到的是同一批产物。
|
||||||
|
|
||||||
|
## 已完成:gpt-5.4-mini 2x2 arm
|
||||||
|
|
||||||
|
Report:
|
||||||
|
|
||||||
|
```text
|
||||||
|
.aituner-reports/qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z/report.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Aggregate:
|
||||||
|
|
||||||
|
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | TTT | AUC | Failed | No feasible |
|
||||||
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| `harness` | harness | 8 | 0.3217 | 1.0000 | 3 | 0.9483 | 0 | 1 |
|
||||||
|
| `naive` | naive | 8 | - | - | - | 0.0000 | 2 | 8 |
|
||||||
|
|
||||||
|
Interpretation:
|
||||||
|
|
||||||
|
- `gpt-5.4-mini + harness` 找到了 `0.3217 req/s/GPU`,达到该 report 的
|
||||||
|
reference best。
|
||||||
|
- `gpt-5.4-mini + naive` 8 个 trials 都没有找到 feasible config,其中 2 个是
|
||||||
|
engine launch failure。
|
||||||
|
- Report 中 `Harness-vs-naive pass/checks: 0/1` 是 aggregator 对
|
||||||
|
`best_naive_final_per_gpu = null` 的保守处理:因为 naive 没有 feasible best,
|
||||||
|
final ratio 无法计算,所以 pass 记为 false。就实际 tuning 结果而言,这个 arm
|
||||||
|
是 harness dominates naive。
|
||||||
|
|
||||||
|
Harness trajectory:
|
||||||
|
|
||||||
|
| Trial | Patch | req/s/GPU | Pass rate | 说明 |
|
||||||
|
| ---: | --- | ---: | ---: | --- |
|
||||||
|
| 1 | `TP=8, DP=1` | 0.2879 | 0.9522 | 初始 topology 满足 SLO,但未达到最终 best。 |
|
||||||
|
| 2 | `TP=8, max-num-seqs=96` | 0.2879 | 0.9537 | 单独调 `max-num-seqs` 无明显提升。 |
|
||||||
|
| 3 | `TP=8, max-num-batched-tokens=16384, max-num-seqs=96` | 0.3085 | 0.9568 | joint runtime probe 提升。 |
|
||||||
|
| 4 | `TP=8, max-num-seqs=144, max-num-batched-tokens=32768` | 0.2879 | 0.9530 | 过大的 batching/seq 组合回退。 |
|
||||||
|
| 5 | `TP=4, DP=2` | - | - | 无 feasible best,说明 DP-heavy/mixed topology 不解决该 prefill path。 |
|
||||||
|
| 6 | `TP=8, max-num-seqs=96, max-num-batched-tokens=24576` | 0.2708 | 0.9523 | batching 进一步增大后回退。 |
|
||||||
|
| 7 | `TP=4, DP=1, max-num-seqs=96, max-num-batched-tokens=16384` | 0.2338 | 0.9590 | 少用 GPU 的 TP4/DP1 per-GPU 不占优。 |
|
||||||
|
| 8 | `TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384` | 0.3217 | 0.9508 | 当前 best。 |
|
||||||
|
|
||||||
|
这个结果说明:在 Qwen235B prefill case 上,harness 的价值不只是 topology
|
||||||
|
选择,还包括在 TTFT/prefill 方向下做受约束的 runtime joint probe。最终 best 是
|
||||||
|
`TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384`。
|
||||||
|
|
||||||
|
## 正在运行:gpt-5.5 2x2 arm
|
||||||
|
|
||||||
|
Session:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tmux: aituner-q235b-2x2-gpt55-20260623T010038Z
|
||||||
|
driver log: .aituner/qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z.driver.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Driver timeline:
|
||||||
|
|
||||||
|
```text
|
||||||
|
harness clean pair start 2026-06-23T01:00:40+00:00
|
||||||
|
harness clean pair done 2026-06-23T08:21:13+00:00
|
||||||
|
naive clean pair start 2026-06-23T08:21:13+00:00
|
||||||
|
```
|
||||||
|
|
||||||
|
Harness side has completed all 8 trials:
|
||||||
|
|
||||||
|
| Trial | Patch | req/s/GPU | Pass rate |
|
||||||
|
| ---: | --- | ---: | ---: |
|
||||||
|
| 1 | `TP=8, DP=1` | 0.2879 | 0.9522 |
|
||||||
|
| 2 | `TP=8, max-num-seqs=96` | 0.2879 | 0.9530 |
|
||||||
|
| 3 | `TP=8, max-num-batched-tokens=16384, max-num-seqs=96` | 0.3085 | 0.9561 |
|
||||||
|
| 4 | `TP=8, max-num-batched-tokens=32768, max-num-seqs=144` | 0.2783 | 0.9543 |
|
||||||
|
| 5 | `TP=8, DP=1, max-num-batched-tokens=24576, max-num-seqs=96` | 0.2654 | 0.9513 |
|
||||||
|
| 6 | `TP=4, DP=2, max-num-batched-tokens=16384, max-num-seqs=96` | - | - |
|
||||||
|
| 7 | `TP=8, DP=1, max-num-batched-tokens=16384, max-num-seqs=80` | 0.3156 | 0.9505 |
|
||||||
|
| 8 | `TP=8, max-num-batched-tokens=32768, max-num-seqs=120` | 0.2879 | 0.9508 |
|
||||||
|
|
||||||
|
Current harness best: `trial-0007`, `0.3156 req/s/GPU`.
|
||||||
|
|
||||||
|
Naive side is still running. Current state:
|
||||||
|
|
||||||
|
- Completed/recorded through trial-0003, with current best `0.2879 req/s/GPU`.
|
||||||
|
- trial-0004 is active with `TP=8, DP=1, max-num-batched-tokens=8192,
|
||||||
|
max-num-seqs=128`.
|
||||||
|
- trial-0004 probe history so far:
|
||||||
|
|
||||||
|
| threshold | request rate | req/s/GPU | pass rate | feasible | main failures |
|
||||||
|
| ---: | ---: | ---: | ---: | --- | --- |
|
||||||
|
| 0.0625 | 1.5750 | 0.1969 | 0.9651 | true | TTFT misses and TTFT threshold violations |
|
||||||
|
| 0.09375 | 2.3650 | 0.2956 | 0.7308 | false | `slo_pass_rate_unrecoverable`, TTFT violations |
|
||||||
|
| 0.078125 | 1.9567 | 0.2446 | 0.9591 | true | TTFT misses and TTFT threshold violations |
|
||||||
|
|
||||||
|
As of the snapshot, vLLM is still processing requests for trial-0004, so the naive
|
||||||
|
side has not produced its final result or report yet.
|
||||||
|
|
||||||
|
## Prior Qwen235B context
|
||||||
|
|
||||||
|
These earlier runs explain why the current 2x2 matters:
|
||||||
|
|
||||||
|
| Run | Result | What it showed |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `qwen235b-prefill-clean-gpt55-dash1-20260621T160712Z` | harness 0.2879, naive 0.3217 | Earlier harness stopped/refined too weakly; naive found better final config. |
|
||||||
|
| `qwen235b-prefill-seqguard-gpt55-dash1-20260622T064445Z` | harness 0.2879, naive 0.2577 | Seq guard prevented the worst early-stop failure but still did not reach the old naive best. |
|
||||||
|
| `qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z` | harness-only 0.3085 | Joint `max-num-batched-tokens + max-num-seqs` probe improved over seqguard. |
|
||||||
|
| `qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z` | harness 0.3217, naive no feasible | Weak model plus harness now reaches the old best and dominates weak naive. |
|
||||||
|
|
||||||
|
The current evidence points to the harness needing both:
|
||||||
|
|
||||||
|
1. topology discipline: stay on `TP=8, DP=1` for this prefill-heavy 235B setup;
|
||||||
|
2. runtime joint probing: tune `max-num-batched-tokens` and `max-num-seqs` together
|
||||||
|
instead of stopping after the first feasible TP8 result.
|
||||||
|
|
||||||
|
## Open item
|
||||||
|
|
||||||
|
The final Qwen235B 2x2 conclusion is blocked on the still-running
|
||||||
|
`gpt-5.5 + naive` arm on dash1. Once it completes, generate an aggregate report
|
||||||
|
combining:
|
||||||
|
|
||||||
|
- `qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z`
|
||||||
|
- `qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z`
|
||||||
|
|
||||||
|
and then update this progress report into a final ablation report.
|
||||||
Reference in New Issue
Block a user