6.0 KiB
qwen235b Thinking Prefill Harness Ablation, 2026-05-10
Setup
- Host:
dash0 - Engine: internal vLLM at
/usr/local/bin/vllm - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Trace window:
thinking_w20260327_1000 - Request mode: chat, with
completion_tokens_override=1for prefill-only measurement - SLO: TTFT-only stepped p95 pass target, target pass rate
0.95- input tokens
<=4096:3000 ms - input tokens
<=32768:6000 ms - otherwise:
9000 ms
- input tokens
- Search:
sampling_uin[0, 0.125], tolerance0.001, max probes6 - Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early
- Store root:
.aituner-prefill
The two fresh specs were identical except study_id and llm.use_harness:
- no-harness:
.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json - harness:
.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json
Both runs were launched through python3 -m aituner.cli study tune; no proposal or study state was edited manually during tuning.
Result
The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).
Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point under the SLO, either because the engine/probe failed or because every sampled probe was infeasible.
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
no-harness raw perf[i] |
0.2029 | NA | NA | 0.3863 | NA | NA | NA | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
harness raw perf[i] |
0.2029 | NA | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal.
Per-trial details:
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness, per-trial | 0.2029 | - | - | 0.3863 | - | - | - | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
| harness, per-trial | 0.2029 | - | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop |
Best-so-far curve, shown only to explain final incumbent selection:
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness | 0.2029 | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 |
| harness | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 |
For plotting raw perf[i], the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.
Final best:
| Variant | GPU trials spent | Best trial | Best config summary | Best req/s | Best req/s/GPU | Final vs no-harness |
|---|---|---|---|---|---|---|
| no-harness | 12 | trial-0011/trial-0012 |
TP8, DP1, EP off, max-num-batched-tokens 7936/8064 |
3.1200 | 0.3900 | baseline |
| harness | 3 | trial-0003 |
TP8, DP1, EP off | 3.0900 | 0.3863 | -0.96% |
Harness reached 0.38625 req/s/GPU at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from 0.38625 to 0.39000 req/s/GPU, an absolute gain of 0.00375 req/s/GPU or 0.97%.
What the Harness Did
The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule:
- incumbent highest feasible probe:
sampling_u=0.123046875 - configured
search.high:0.125 - binary-search resolution:
(0.125 - 0.0) / 2^6 = 0.001953125 - gap to search high:
0.001953125
Because the incumbent was feasible and within one configured search resolution of search.high, the harness emitted harness-stop-0004 before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing search.high; it is not a claim of global engine optimality.
The harness context also made the LLM response more directed after failure:
- After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around
sampling_u=0.03515625. - The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with
connection refused, matching the no-harness failure family. - The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family.
- No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1.
Conclusion
Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.
The limitation is that the generic search-high stop guard stopped before local runtime tuning of max-num-batched-tokens, which no-harness used to recover a small additional 0.97%. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise search.high or disable search-high early stop for a local-polish phase.