Files
aituner/docs/harness-ablation/qwen235b-prefill-2x2-progress-20260623.md

6.4 KiB
Raw Permalink Blame History

Qwen235B prefill 2x2 progress - 2026-06-23

Snapshot: 2026-06-23 18:24 CST / 10:24 UTC.

本文整理当前 dash1/dash2/dash3 上的 Qwen235B prefill 2x2 实验进度。这个 case 仍在跑 strong-model arm因此本文是 progress report不是最终 aggregate 结论。

当前远端状态

Host 当前状态 说明
dash1 running aituner-q235b-2x2-gpt55-20260623T010038Z 仍在跑,当前是 gpt-5.5 + naive 的 trial-00048 张 H20 被 vLLM 占用。
dash2 idle 没有 tmux/GPU 任务;最近完成的是 qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z harness-only 验证。
dash3 idle 没有 tmux/GPU 任务;gpt-5.4-mini 2x2 arm 已完成并生成 report。

注意:三台机器共享 /home/admin/cpfs/wjh/aituner/aituner,所以 .aituner.aituner-reports 在不同 dash 节点上看到的是同一批产物。

已完成gpt-5.4-mini 2x2 arm

Report:

.aituner-reports/qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z/report.md

Aggregate:

Arm Kind Trials Final req/s/GPU Final/ref TTT AUC Failed No feasible
harness harness 8 0.3217 1.0000 3 0.9483 0 1
naive naive 8 - - - 0.0000 2 8

Interpretation:

  • gpt-5.4-mini + harness 找到了 0.3217 req/s/GPU,达到该 report 的 reference best。
  • gpt-5.4-mini + naive 8 个 trials 都没有找到 feasible config其中 2 个是 engine launch failure。
  • Report 中 Harness-vs-naive pass/checks: 0/1 是 aggregator 对 best_naive_final_per_gpu = null 的保守处理:因为 naive 没有 feasible best final ratio 无法计算,所以 pass 记为 false。就实际 tuning 结果而言,这个 arm 是 harness dominates naive。

Harness trajectory:

Trial Patch req/s/GPU Pass rate 说明
1 TP=8, DP=1 0.2879 0.9522 初始 topology 满足 SLO但未达到最终 best。
2 TP=8, max-num-seqs=96 0.2879 0.9537 单独调 max-num-seqs 无明显提升。
3 TP=8, max-num-batched-tokens=16384, max-num-seqs=96 0.3085 0.9568 joint runtime probe 提升。
4 TP=8, max-num-seqs=144, max-num-batched-tokens=32768 0.2879 0.9530 过大的 batching/seq 组合回退。
5 TP=4, DP=2 - - 无 feasible best说明 DP-heavy/mixed topology 不解决该 prefill path。
6 TP=8, max-num-seqs=96, max-num-batched-tokens=24576 0.2708 0.9523 batching 进一步增大后回退。
7 TP=4, DP=1, max-num-seqs=96, max-num-batched-tokens=16384 0.2338 0.9590 少用 GPU 的 TP4/DP1 per-GPU 不占优。
8 TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384 0.3217 0.9508 当前 best。

这个结果说明:在 Qwen235B prefill case 上harness 的价值不只是 topology 选择,还包括在 TTFT/prefill 方向下做受约束的 runtime joint probe。最终 best 是 TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384

正在运行gpt-5.5 2x2 arm

Session:

tmux: aituner-q235b-2x2-gpt55-20260623T010038Z
driver log: .aituner/qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z.driver.log

Driver timeline:

harness clean pair start 2026-06-23T01:00:40+00:00
harness clean pair done  2026-06-23T08:21:13+00:00
naive clean pair start   2026-06-23T08:21:13+00:00

Harness side has completed all 8 trials:

Trial Patch req/s/GPU Pass rate
1 TP=8, DP=1 0.2879 0.9522
2 TP=8, max-num-seqs=96 0.2879 0.9530
3 TP=8, max-num-batched-tokens=16384, max-num-seqs=96 0.3085 0.9561
4 TP=8, max-num-batched-tokens=32768, max-num-seqs=144 0.2783 0.9543
5 TP=8, DP=1, max-num-batched-tokens=24576, max-num-seqs=96 0.2654 0.9513
6 TP=4, DP=2, max-num-batched-tokens=16384, max-num-seqs=96 - -
7 TP=8, DP=1, max-num-batched-tokens=16384, max-num-seqs=80 0.3156 0.9505
8 TP=8, max-num-batched-tokens=32768, max-num-seqs=120 0.2879 0.9508

Current harness best: trial-0007, 0.3156 req/s/GPU.

Naive side is still running. Current state:

  • Completed/recorded through trial-0003, with current best 0.2879 req/s/GPU.
  • trial-0004 is active with TP=8, DP=1, max-num-batched-tokens=8192, max-num-seqs=128.
  • trial-0004 probe history so far:
threshold request rate req/s/GPU pass rate feasible main failures
0.0625 1.5750 0.1969 0.9651 true TTFT misses and TTFT threshold violations
0.09375 2.3650 0.2956 0.7308 false slo_pass_rate_unrecoverable, TTFT violations
0.078125 1.9567 0.2446 0.9591 true TTFT misses and TTFT threshold violations
0.0859375 2.1667 0.2708 0.9546 true TTFT misses and TTFT threshold violations

As of the snapshot, vLLM is still processing requests for trial-0004, so the naive side has not produced its final result or report yet.

Prior Qwen235B context

These earlier runs explain why the current 2x2 matters:

Run Result What it showed
qwen235b-prefill-clean-gpt55-dash1-20260621T160712Z harness 0.2879, naive 0.3217 Earlier harness stopped/refined too weakly; naive found better final config.
qwen235b-prefill-seqguard-gpt55-dash1-20260622T064445Z harness 0.2879, naive 0.2577 Seq guard prevented the worst early-stop failure but still did not reach the old naive best.
qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z harness-only 0.3085 Joint max-num-batched-tokens + max-num-seqs probe improved over seqguard.
qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z harness 0.3217, naive no feasible Weak model plus harness now reaches the old best and dominates weak naive.

The current evidence points to the harness needing both:

  1. topology discipline: stay on TP=8, DP=1 for this prefill-heavy 235B setup;
  2. runtime joint probing: tune max-num-batched-tokens and max-num-seqs together instead of stopping after the first feasible TP8 result.

Open item

The final Qwen235B 2x2 conclusion is blocked on the still-running gpt-5.5 + naive arm on dash1. Once it completes, generate an aggregate report combining:

  • qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z
  • qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z

and then update this progress report into a final ablation report.