Files
aituner/docs/harness-ablation/qwen235b-thinking-prefill-ttft-20260510.md

4.7 KiB

qwen235b Thinking Prefill Harness Ablation, 2026-05-10

Setup

  • Host: dash0
  • Engine: internal vLLM at /usr/local/bin/vllm
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Trace window: thinking_w20260327_1000
  • Request mode: chat, with completion_tokens_override=1 for prefill-only measurement
  • SLO: TTFT-only stepped p95 pass target, target pass rate 0.95
    • input tokens <=4096: 3000 ms
    • input tokens <=32768: 6000 ms
    • otherwise: 9000 ms
  • Search: sampling_u in [0, 0.125], tolerance 0.001, max probes 6
  • Trial budget: no-harness allowed 12 GPU trials; harness allowed 12 but could stop early
  • Store root: .aituner-prefill

The two fresh specs were identical except study_id and llm.use_harness:

  • no-harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-noharness-rerun2-20260510.json
  • harness: .aituner-prefill/specs/dash0-qwen235b-prefill-thinking-run1-ttft-harness-ablation-12iter-harness-rerun2-20260510.json

Both runs were launched through python3 -m aituner.cli study tune; no proposal or study state was edited manually during tuning.

Result

Throughput is best_request_rate_per_gpu for each trial. - means the trial did not produce a feasible point.

Variant iter1 iter2 iter3 iter4 iter5 iter6 iter7 iter8 iter9 iter10 iter11 iter12
no-harness, per-trial 0.2029 - - 0.3863 - - - 0.3879 0.3892 0.3896 0.3900 0.3900
harness, per-trial 0.2029 - 0.3863 stop stop stop stop stop stop stop stop stop

Best-so-far curve:

Variant iter1 iter2 iter3 iter4 iter5 iter6 iter7 iter8 iter9 iter10 iter11 iter12
no-harness 0.2029 0.2029 0.2029 0.3863 0.3863 0.3863 0.3863 0.3879 0.3892 0.3896 0.3900 0.3900
harness 0.2029 0.2029 0.3863 0.3863 0.3863 0.3863 0.3863 0.3863 0.3863 0.3863 0.3863 0.3863

Final best:

Variant GPU trials spent Best trial Best config summary Best req/s Best req/s/GPU Final vs no-harness
no-harness 12 trial-0011/trial-0012 TP8, DP1, EP off, max-num-batched-tokens 7936/8064 3.1200 0.3900 baseline
harness 3 trial-0003 TP8, DP1, EP off 3.0900 0.3863 -0.96%

Harness reached 0.38625 req/s/GPU at iter3. No-harness first reached the same TP8 family at iter4, then spent eight more GPU trials to move from 0.38625 to 0.39000 req/s/GPU, an absolute gain of 0.00375 req/s/GPU or 0.97%.

What the Harness Did

The harness did not use a testcase-specific throughput threshold. The stop decision came from the generic search-high saturation rule:

  • incumbent highest feasible probe: sampling_u=0.123046875
  • configured search.high: 0.125
  • binary-search resolution: (0.125 - 0.0) / 2^6 = 0.001953125
  • gap to search high: 0.001953125

Because the incumbent was feasible and within one configured search resolution of search.high, the harness emitted harness-stop-0004 before launching another GPU trial. This means the current study could no longer measure a materially higher workload without increasing search.high; it is not a claim of global engine optimality.

The harness context also made the LLM response more directed after failure:

  • After baseline, it exposed the TTFT-only prefill bottleneck and the sharp queueing knee around sampling_u=0.03515625.
  • The LLM first chose TP4/DP2 to use the idle 4 GPUs while preserving the validated TP4 shard shape. This failed with connection refused, matching the no-harness failure family.
  • The next harness prompt included that failure, and the LLM switched to TP8/DP1 with EP off, explicitly avoiding the failed DP2 family.
  • No-harness inserted an extra EP4 launch-failure trial before reaching TP8/DP1.

Conclusion

Harness accelerated convergence mainly through early stopping, not by finding a much better final config on this setup. It reduced GPU trials from 12 to 3 while preserving 99.0% of the no-harness final throughput. It also reached the first strong TP8 point one trial earlier than no-harness.

The limitation is that the generic search-high stop guard stopped before local runtime tuning of max-num-batched-tokens, which no-harness used to recover a small additional 0.97%. For this setup, that tradeoff is acceptable if the goal is fast convergence under a fixed measurement ceiling; if the goal is exact final throughput, the next study should raise search.high or disable search-high early stop for a local-polish phase.