qwen235b Thinking Prefill Harness Ablation, 2026-05-10

Setup

Host: dash0
Engine: internal vLLM at /usr/local/bin/vllm
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Trace window: thinking_w20260327_1000
Request mode: chat, with completion_tokens_override=1 for prefill-only measurement
SLO: TTFT-only stepped p95 pass target, target pass rate 0.95
- input tokens <=4096: 3000 ms
- input tokens <=32768: 6000 ms
- otherwise: 9000 ms
Search: sampling_u in [0, 0.125], tolerance 0.001, max probes 6
Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early
Store root: .aituner-prefill

The two fresh specs were identical except study_id and llm.use_harness:

no-harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json
harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json

Both runs were launched through python3 -m aituner.cli study tune; no proposal or study state was edited manually during tuning.

Result

The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).

Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point under the SLO, either because the engine/probe failed or because every sampled probe was infeasible.

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness raw `perf[i]`	0.2029	NA	NA	0.3863	NA	NA	NA	0.3879	0.3892	0.3896	0.3900	0.3900
harness raw `perf[i]`	0.2029	NA	0.3863	stop	stop	stop	stop	stop	stop	stop	stop	stop

The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal.

Per-trial details:

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness, per-trial	0.2029	-	-	0.3863	-	-	-	0.3879	0.3892	0.3896	0.3900	0.3900
harness, per-trial	0.2029	-	0.3863	stop	stop	stop	stop	stop	stop	stop	stop	stop

Best-so-far curve, shown only to explain final incumbent selection:

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness	0.2029	0.2029	0.2029	0.3863	0.3863	0.3863	0.3863	0.3879	0.3892	0.3896	0.3900	0.3900
harness	0.2029	0.2029	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863

For plotting raw perf[i], the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.

Final best:

Variant	GPU trials spent	Best trial	Best config summary	Best req/s	Best req/s/GPU	Final vs no-harness
no-harness	12	`trial-0011`/`trial-0012`	TP8, DP1, EP off, `max-num-batched-tokens` 7936/8064	3.1200	0.3900	baseline
harness	3	`trial-0003`	TP8, DP1, EP off	3.0900	0.3863	-0.96%

Harness reached 0.38625 req/s/GPU at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from 0.38625 to 0.39000 req/s/GPU, an absolute gain of 0.00375 req/s/GPU or 0.97%.

What the Harness Did

The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule:

incumbent highest feasible probe: sampling_u=0.123046875
configured search.high: 0.125
binary-search resolution: (0.125 - 0.0) / 2^6 = 0.001953125
gap to search high: 0.001953125

Because the incumbent was feasible and within one configured search resolution of search.high, the harness emitted harness-stop-0004 before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing search.high; it is not a claim of global engine optimality.

The harness context also made the LLM response more directed after failure:

After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around sampling_u=0.03515625.
The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with connection refused, matching the no-harness failure family.
The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family.
No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1.

Conclusion

Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.

The limitation is that the generic search-high stop guard stopped before local runtime tuning of max-num-batched-tokens, which no-harness used to recover a small additional 0.97%. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise search.high or disable search-high early stop for a local-polish phase.

6.0 KiB Raw Blame History

qwen235b Thinking Prefill Harness Ablation, 2026-05-10

Setup

Result

What the Harness Did

Conclusion

6.0 KiB

Raw Blame History