6.5 KiB
qwen235b Thinking Prefill Harness Ablation, 2026-05-10
Setup
- Host:
dash0 - Engine: internal vLLM at
/usr/local/bin/vllm - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Trace window:
thinking_w20260327_1000 - Request mode: chat, with
completion_tokens_override=1for prefill-only measurement - SLO: TTFT-only stepped p95 pass target, target pass rate
0.95- input tokens
<=4096:3000 ms - input tokens
<=32768:6000 ms - otherwise:
9000 ms
- input tokens
- Search:
sampling_uin[0, 0.125], tolerance0.001, max probes6 - Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early
- Store root:
.aituner-prefill
The two fresh specs were identical except study_id and llm.use_harness:
- no-harness:
.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json - harness:
.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json
Both runs were launched through python3 -m aituner.cli study tune; no proposal or study state was edited manually during tuning.
Result
The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).
Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point in the measured search range, either because the engine/probe failed or because every sampled probe was infeasible.
Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent sampling_u as the new search floor. If the config was infeasible above that floor, the old worker wrote NA without searching below the floor. Therefore the NA entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance.
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
no-harness raw perf[i] |
0.2029 | NA | NA | 0.3863 | NA | NA | NA | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
harness raw perf[i] |
0.2029 | NA | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal.
Per-trial details:
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness, per-trial | 0.2029 | - | - | 0.3863 | - | - | - | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
| harness, per-trial | 0.2029 | - | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
Best-so-far curve, shown only to explain final incumbent selection:
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness | 0.2029 | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
| harness | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 |
For plotting raw perf[i], the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.
Final best:
| Variant | GPU trials spent | Best trial | Best config summary | Best req/s | Best req/s/GPU | Final vs no-harness |
|---|---|---|---|---|---|---|
| no-harness | 12 | trial-0011/trial-0012 |
TP8, DP1, EP off, max-num-batched-tokens 7936/8064 |
3.1200 | 0.3900 | baseline |
| harness | 3 | trial-0003 |
TP8, DP1, EP off | 3.0900 | 0.3863 | -0.96% |
Harness reached 0.38625 req/s/GPU at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from 0.38625 to 0.39000 req/s/GPU, an absolute gain of 0.00375 req/s/GPU or 0.97%.
What the Harness Did
The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule:
- incumbent highest feasible probe:
sampling_u=0.123046875 - configured
search.high:0.125 - binary-search resolution:
(0.125 - 0.0) / 2^6 = 0.001953125 - gap to search high:
0.001953125
Because the incumbent was feasible and within one configured search resolution of search.high, the harness emitted harness-stop-0004 before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing search.high; it is not a claim of global engine optimality.
The harness context also made the LLM response more directed after failure:
- After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around
sampling_u=0.03515625. - The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with
connection refused, matching the no-harness failure family. - The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family.
- No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1.
Conclusion
Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.
The limitation is that the generic search-high stop guard stopped before local runtime tuning of max-num-batched-tokens, which no-harness used to recover a small additional 0.97%. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise search.high or disable search-high early stop for a local-polish phase.