7.1 KiB
qwen27b-chat-0-8k Current-Config Fig18 Plan
Question
The earlier tables used best-so-far throughput. That is useful for deciding the best deployable incumbent, but it hides bad proposals because the curve is monotonic by construction. To judge whether the harness makes tuning more directional, the primary curve must be each iteration's measured current config performance.
Why Final Performance Can Be Close
Harness and no-harness can converge to similar final throughput when the search
space contains one dominant simple family. In this setup the dominant family is
TP=2, DP=1 over the run_qwen27b.sh baseline. The no-harness LLM can still
eventually discover that family within 12 iterations, so final best performance
can be close.
The difference the harness is expected to improve is not necessarily the final 12-iter maximum. It should improve:
- iterations-to-first-good-config;
- number of worse or infeasible proposals after an incumbent is found;
- measured-current config oscillation;
- early-stop behavior once adjacent harness probes no longer justify more GPU trials.
Metrics
measured-current: each trial's own feasiblerequest_rate_per_gpu. Failed or no-feasible-point trials are recorded asNA.accepted-incumbent: best deployable value after each trial. This is the standard best-so-far curve and is monotonic by definition.iters-to-best: first iteration where the final best value or equivalent config family appears.wasted-trials-after-best: trials after first best that are worse, infeasible, or no-feasible-point.
Historical Run9 Re-Read
Source:
.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
on dash0.
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
Interpretation: the no-harness current-config curve already has a regression at iter 3 and then many no-feasible-point runtime probes. The monotonic curve only shows the incumbent policy, not proposal quality.
New Paired Test Plan
Run on dash0 with internal vLLM and the real chat_w20260311_1000 0-8k replay:
- Base spec:
configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json. - Model path:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal. - Engine:
/usr/local/bin/vllm, baseline aligned with~/run_qwen27b.sh. - SLO: 95% pass, stepped TTFT
2s/4s/6s, TPOT<=50ms. - Search:
low=0,high=0.0625,max_probes=6,tolerance=0.001. - no-harness study:
.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-noharness. - harness study:
.aituner-tight/dash0-qwen27b-tight-slo-10min-run10-chat-0-8k-current-harness.
The result table will report both curves. The harness is considered successful only if it reaches the same or better incumbent in fewer iterations and reduces the measured-current regressions or replaces them with an explicit harness stop.
Run Status
- 2026-05-06 07:05 CST: dash0 checked, 8 H20 GPUs idle.
- 2026-05-06 07:05 CST: generated paired specs under
.aituner-tight/specs/. - 2026-05-06 07:05 CST: started no-harness full 12-iter run in tmux session
qwen27b_run10_noharness_20260506. - 2026-05-06 07:18 CST: stopped the duplicate fresh no-harness run before completion. Reason: run9 is already a completed real 12-iter no-harness run for the same internal vLLM 0-8k setup, while the fresh full-chat run would spend a multi-hour dash0 slot duplicating that curve.
- 2026-05-06 07:20 CST: seeded the harness study with the real run9 baseline
measurement as
trial-0001, then started the harness run with--skip-baselinein tmux sessionqwen27b_run10_harness_skipbase_20260506. - 2026-05-06 07:20 CST: harness generated deterministic
trial-0002:{"tensor-parallel-size": 2}. - 2026-05-06 08:11 CST: harness
trial-0002completed:TP=2,0.2142 request_rate_per_gpu. - 2026-05-06 08:19 CST: harness
trial-0003failed at engine launch. Root cause: the old runtime refinement coupledgpu-memory-utilization=0.95with largermax-num-batched-tokens, causing speculative sampler warmup OOM. This is a generic harness safety bug; fixed locally by removing the automatic memory-utilization bump from runtime refinement. - 2026-05-06 09:24 CST: harness
trial-0004completed:TP=4,0.4429 request_rate_per_gpu. All six probes were feasible up tosampling_u=0.0615234375, so this study is near the configuredsearch.high=0.0625ceiling. - 2026-05-06 09:25 CST: old harness repeated the same unsafe runtime refinement
for TP4 and
trial-0005failed at engine launch for the same OOM reason. The old process was stopped before continuing.
Current Results
Unit: feasible request_rate_per_gpu. NA means the current trial did not
produce a feasible deployable config.
| Variant | Curve | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness run9 | measured-current | 0.0350 | 0.0617 | 0.0392 | 0.2025 | NA | NA | NA | NA | NA | NA | NA | NA |
| no-harness run9 | accepted-incumbent | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
| harness run10 | measured-current | 0.0350 | 0.2142 | NA | 0.4429 | NA | pending | pending | pending | pending | pending | pending | pending |
| harness run10 | accepted-incumbent | 0.0350 | 0.2142 | 0.2142 | 0.4429 | 0.4429 | pending | pending | pending | pending | pending | pending | pending |
The harness result is stronger than the earlier strict replay. It did not merely reach the same TP2 region earlier; it then used the bottleneck/topology evidence to validate TP4 and found a much higher current config.
Interpretation
- Why both variants can look close when only best-so-far is shown: no-harness can eventually find a good simple topology, and best-so-far hides every bad proposal after that point.
- What the current-config curve shows: no-harness regresses at iter 3 and then spends many iterations on no-feasible-point runtime probes. Harness reaches a stronger TP2 config at iter 2 and a stronger TP4 config at iter 4.
- Why harness helped: the baseline diagnostics identify TTFT/prefill as the active bottleneck on low-prefix-reuse long prompts. The harness maps that to adjacent TP validation before DP/runtime exploration. The no-harness LLM chose DP2 then DP4 first, which diluted per-GPU throughput and delayed TP.
- Remaining defect found during the run: runtime refinement was too aggressive because it combined larger MBT with higher memory utilization. This has been fixed so future runtime validation changes batching headroom without also raising memory pressure.