Qwen235B prefill 2x2 progress - 2026-06-23

Snapshot: 2026-06-23 18:24 CST / 10:24 UTC.

本文整理当前 dash1/dash2/dash3 上的 Qwen235B prefill 2x2 实验进度。这个 case 仍在跑 strong-model arm，因此本文是 progress report，不是最终 aggregate 结论。

当前远端状态

Host	当前状态	说明
dash1	running	`aituner-q235b-2x2-gpt55-20260623T010038Z` 仍在跑，当前是 `gpt-5.5 + naive` 的 trial-0004；8 张 H20 被 vLLM 占用。
dash2	idle	没有 tmux/GPU 任务；最近完成的是 `qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z` harness-only 验证。
dash3	idle	没有 tmux/GPU 任务；`gpt-5.4-mini` 2x2 arm 已完成并生成 report。

注意：三台机器共享 /home/admin/cpfs/wjh/aituner/aituner，所以 .aituner 和 .aituner-reports 在不同 dash 节点上看到的是同一批产物。

已完成：gpt-5.4-mini 2x2 arm

Report:

.aituner-reports/qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z/report.md

Aggregate:

Arm	Kind	Trials	Final req/s/GPU	Final/ref	TTT	AUC	Failed	No feasible
`harness`	harness	8	0.3217	1.0000	3	0.9483	0	1
`naive`	naive	8	-	-	-	0.0000	2	8

Interpretation:

gpt-5.4-mini + harness 找到了 0.3217 req/s/GPU，达到该 report 的 reference best。
gpt-5.4-mini + naive 8 个 trials 都没有找到 feasible config，其中 2 个是 engine launch failure。
Report 中 Harness-vs-naive pass/checks: 0/1 是 aggregator 对 best_naive_final_per_gpu = null 的保守处理：因为 naive 没有 feasible best， final ratio 无法计算，所以 pass 记为 false。就实际 tuning 结果而言，这个 arm 是 harness dominates naive。

Harness trajectory:

Trial	Patch	req/s/GPU	Pass rate	说明
1	`TP=8, DP=1`	0.2879	0.9522	初始 topology 满足 SLO，但未达到最终 best。
2	`TP=8, max-num-seqs=96`	0.2879	0.9537	单独调 `max-num-seqs` 无明显提升。
3	`TP=8, max-num-batched-tokens=16384, max-num-seqs=96`	0.3085	0.9568	joint runtime probe 提升。
4	`TP=8, max-num-seqs=144, max-num-batched-tokens=32768`	0.2879	0.9530	过大的 batching/seq 组合回退。
5	`TP=4, DP=2`	-	-	无 feasible best，说明 DP-heavy/mixed topology 不解决该 prefill path。
6	`TP=8, max-num-seqs=96, max-num-batched-tokens=24576`	0.2708	0.9523	batching 进一步增大后回退。
7	`TP=4, DP=1, max-num-seqs=96, max-num-batched-tokens=16384`	0.2338	0.9590	少用 GPU 的 TP4/DP1 per-GPU 不占优。
8	`TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384`	0.3217	0.9508	当前 best。

这个结果说明：在 Qwen235B prefill case 上，harness 的价值不只是 topology 选择，还包括在 TTFT/prefill 方向下做受约束的 runtime joint probe。最终 best 是 TP=8, DP=1, max-num-seqs=128, max-num-batched-tokens=16384。

正在运行：gpt-5.5 2x2 arm

Session:

tmux: aituner-q235b-2x2-gpt55-20260623T010038Z
driver log: .aituner/qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z.driver.log

Driver timeline:

harness clean pair start 2026-06-23T01:00:40+00:00
harness clean pair done  2026-06-23T08:21:13+00:00
naive clean pair start   2026-06-23T08:21:13+00:00

Harness side has completed all 8 trials:

Trial	Patch	req/s/GPU	Pass rate
1	`TP=8, DP=1`	0.2879	0.9522
2	`TP=8, max-num-seqs=96`	0.2879	0.9530
3	`TP=8, max-num-batched-tokens=16384, max-num-seqs=96`	0.3085	0.9561
4	`TP=8, max-num-batched-tokens=32768, max-num-seqs=144`	0.2783	0.9543
5	`TP=8, DP=1, max-num-batched-tokens=24576, max-num-seqs=96`	0.2654	0.9513
6	`TP=4, DP=2, max-num-batched-tokens=16384, max-num-seqs=96`	-	-
7	`TP=8, DP=1, max-num-batched-tokens=16384, max-num-seqs=80`	0.3156	0.9505
8	`TP=8, max-num-batched-tokens=32768, max-num-seqs=120`	0.2879	0.9508

Current harness best: trial-0007, 0.3156 req/s/GPU.

Naive side is still running. Current state:

Completed/recorded through trial-0003, with current best 0.2879 req/s/GPU.
trial-0004 is active with TP=8, DP=1, max-num-batched-tokens=8192, max-num-seqs=128.
trial-0004 probe history so far:

threshold	request rate	req/s/GPU	pass rate	feasible	main failures
0.0625	1.5750	0.1969	0.9651	true	TTFT misses and TTFT threshold violations
0.09375	2.3650	0.2956	0.7308	false	`slo_pass_rate_unrecoverable`, TTFT violations
0.078125	1.9567	0.2446	0.9591	true	TTFT misses and TTFT threshold violations
0.0859375	2.1667	0.2708	0.9546	true	TTFT misses and TTFT threshold violations

As of the snapshot, vLLM is still processing requests for trial-0004, so the naive side has not produced its final result or report yet.

Prior Qwen235B context

These earlier runs explain why the current 2x2 matters:

Run	Result	What it showed
`qwen235b-prefill-clean-gpt55-dash1-20260621T160712Z`	harness 0.2879, naive 0.3217	Earlier harness stopped/refined too weakly; naive found better final config.
`qwen235b-prefill-seqguard-gpt55-dash1-20260622T064445Z`	harness 0.2879, naive 0.2577	Seq guard prevented the worst early-stop failure but still did not reach the old naive best.
`qwen235b-prefill-jointprobe-harness-dash2-20260622T132010Z`	harness-only 0.3085	Joint `max-num-batched-tokens + max-num-seqs` probe improved over seqguard.
`qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z`	harness 0.3217, naive no feasible	Weak model plus harness now reaches the old best and dominates weak naive.

The current evidence points to the harness needing both:

topology discipline: stay on TP=8, DP=1 for this prefill-heavy 235B setup;
runtime joint probing: tune max-num-batched-tokens and max-num-seqs together instead of stopping after the first feasible TP8 result.

Open item

The final Qwen235B 2x2 conclusion is blocked on the still-running gpt-5.5 + naive arm on dash1. Once it completes, generate an aggregate report combining:

qwen235b-prefill-2x2-gpt55-dash1-20260623T010038Z
qwen235b-prefill-2x2-gpt54mini-dash3-20260623T010038Z

and then update this progress report into a final ablation report.

6.4 KiB Raw Permalink Blame History Unescape Escape