# qwen235b Thinking Prefill Harness Ablation, 2026-05-10 ## Setup - Host: `dash0` - Engine: internal vLLM at `/usr/local/bin/vllm` - Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717` - Trace window: `thinking_w20260327_1000` - Request mode: chat, with `completion_tokens_override=1` for prefill-only measurement - SLO: TTFT-only stepped p95 pass target, target pass rate `0.95` - input tokens `<=4096`: `3000 ms` - input tokens `<=32768`: `6000 ms` - otherwise: `9000 ms` - Search: `sampling_u` in `[0, 0.125]`, tolerance `0.001`, max probes `6` - Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early - Store root: `.aituner-prefill` The two fresh specs were identical except `study_id` and `llm.use_harness`: - no-harness: `.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json` - harness: `.aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json` Both runs were launched through `python3 -m aituner.cli study tune`; no proposal or study state was edited manually during tuning. ## Result The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as `perf[i]`; do not replace missing points with `max(perf[:i+1])`. Metric: `best_request_rate_per_gpu` from that trial's own `result.json`. `NA` means the proposed config did not produce a feasible point in the measured search range, either because the engine/probe failed or because every sampled probe was infeasible. Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent `sampling_u` as the new search floor. If the config was infeasible above that floor, the old worker wrote `NA` without searching below the floor. Therefore the `NA` entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance. | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | no-harness raw `perf[i]` | 0.2029 | NA | NA | 0.3863 | NA | NA | NA | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 | | harness raw `perf[i]` | 0.2029 | NA | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop | The raw no-harness curve is therefore not monotonic. The apparent monotonic 12-iter sequence comes only from plotting best-so-far rather than the measured performance of each proposal. Per-trial details: | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | no-harness, per-trial | 0.2029 | - | - | 0.3863 | - | - | - | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 | | harness, per-trial | 0.2029 | - | 0.3863 | stop | stop | stop | stop | stop | stop | stop | stop | stop | Best-so-far curve, shown only to explain final incumbent selection: | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | no-harness | 0.2029 | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3879 | 0.3892 | 0.3896 | 0.3900 | 0.3900 | | harness | 0.2029 | 0.2029 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | 0.3863 | For plotting raw `perf[i]`, the failed/infeasible points should stay missing or be rendered as invalid trials. If a plotting script requires numeric values, use `0` only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent. Final best: | Variant | GPU trials spent | Best trial | Best config summary | Best req/s | Best req/s/GPU | Final vs no-harness | | --- | ---: | --- | --- | ---: | ---: | ---: | | no-harness | 12 | `trial-0011`/`trial-0012` | TP8, DP1, EP off, `max-num-batched-tokens` 7936/8064 | 3.1200 | 0.3900 | baseline | | harness | 3 | `trial-0003` | TP8, DP1, EP off | 3.0900 | 0.3863 | -0.96% | Harness reached `0.38625 req/s/GPU` at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from `0.38625` to `0.39000 req/s/GPU`, an absolute gain of `0.00375 req/s/GPU` or `0.97%`. ## What the Harness Did The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule: - incumbent highest feasible probe: `sampling_u=0.123046875` - configured `search.high`: `0.125` - binary-search resolution: `(0.125 - 0.0) / 2^6 = 0.001953125` - gap to search high: `0.001953125` Because the incumbent was feasible and within one configured search resolution of `search.high`, the harness emitted `harness-stop-0004` before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing `search.high`; it is not a claim of global engine optimality. The harness context also made the LLM response more directed after failure: - After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around `sampling_u=0.03515625`. - The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with `connection refused`, matching the no-harness failure family. - The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family. - No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1. ## Conclusion Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness. The limitation is that the generic search-high stop guard stopped before local runtime tuning of `max-num-batched-tokens`, which no-harness used to recover a small additional `0.97%`. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise `search.high` or disable search-high early stop for a local-polish phase.