qwen235b Thinking Prefill Harness Ablation, 2026-05-10

Setup

Host: dash0
Engine: internal vLLM at /usr/local/bin/vllm
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Trace window: thinking_w20260327_1000
Request mode: chat, with completion_tokens_override=1 for prefill-only measurement
SLO: TTFT-only stepped p95 pass target, target pass rate 0.95
- input tokens <=4096: 3000 ms
- input tokens <=32768: 6000 ms
- otherwise: 9000 ms
Search: sampling_u in [0, 0.125], tolerance 0.001, max probes 6
Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early
Store root: .aituner-prefill

The two fresh specs were identical except study_id and llm.use_harness:

no-harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json
harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json

Both runs were launched through python3 -m aituner.cli study tune; no proposal or study state was edited manually during tuning.

Result

The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).

Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point in the measured search range, either because the engine/probe failed or because every sampled probe was infeasible.

Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent sampling_u as the new search floor. If the config was infeasible above that floor, the old worker wrote NA without searching below the floor. Therefore the NA entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance.

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness raw `perf[i]`	0.2029	NA	NA	0.3863	NA	NA	NA	0.3879	0.3892	0.3896	0.3900	0.3900
harness raw `perf[i]`	0.2029	NA	0.3863	stop	stop	stop	stop	stop	stop	stop	stop	stop

The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal.

Per-trial details:

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness, per-trial	0.2029	-	-	0.3863	-	-	-	0.3879	0.3892	0.3896	0.3900	0.3900
harness, per-trial	0.2029	-	0.3863	stop	stop	stop	stop	stop	stop	stop	stop	stop

Best-so-far curve, shown only to explain final incumbent selection:

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness	0.2029	0.2029	0.2029	0.3863	0.3863	0.3863	0.3863	0.3879	0.3892	0.3896	0.3900	0.3900
harness	0.2029	0.2029	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863	0.3863

For plotting raw perf[i], the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.

Final best:

Variant	GPU trials spent	Best trial	Best config summary	Best req/s	Best req/s/GPU	Final vs no-harness
no-harness	12	`trial-0011`/`trial-0012`	TP8, DP1, EP off, `max-num-batched-tokens` 7936/8064	3.1200	0.3900	baseline
harness	3	`trial-0003`	TP8, DP1, EP off	3.0900	0.3863	-0.96%

Harness reached 0.38625 req/s/GPU at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from 0.38625 to 0.39000 req/s/GPU, an absolute gain of 0.00375 req/s/GPU or 0.97%.

What the Harness Did

The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule:

incumbent highest feasible probe: sampling_u=0.123046875
configured search.high: 0.125
binary-search resolution: (0.125 - 0.0) / 2^6 = 0.001953125
gap to search high: 0.001953125

Because the incumbent was feasible and within one configured search resolution of search.high, the harness emitted harness-stop-0004 before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing search.high; it is not a claim of global engine optimality.

The harness context also made the LLM response more directed after failure:

After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around sampling_u=0.03515625.
The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with connection refused, matching the no-harness failure family.
The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family.
No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1.

Conclusion

Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.

The limitation is that the generic search-high stop guard stopped before local runtime tuning of max-num-batched-tokens, which no-harness used to recover a small additional 0.97%. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise search.high or disable search-high early stop for a local-polish phase.

6.5 KiB Raw Blame History

qwen235b Thinking Prefill Harness Ablation, 2026-05-10

Setup

Result

What the Harness Did

Conclusion

6.5 KiB

Raw Blame History