10 KiB
Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
This note organizes the aggregate report generated at:
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
The experiment is a 2x2 ablation: model strength crossed with use_harness.
It asks whether the harness supplies reusable search structure beyond a stronger
LLM's free-form tuning proposals.
Experiment Design
Case: qwen27b-tight-slo-2x2-aggregate.
Substrate:
- Model served:
qwen3.5-27b-256k-0223-internal. - Hardware: H20, up to 8 GPUs.
- Trace:
chat_w20260311_1000, input length filtered to 0-8192 tokens,replay_time_scale=1.0,max_concurrency=32. - SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens, 4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
- Search:
sampling_uin[0, 0.0625], tolerance 0.001, max 6 probes. - Tunable envs:
VLLM_ENABLE_TORCH_COMPILE. - Tunable flags:
tensor-parallel-size,data-parallel-size,expert-parallel-size,gpu-memory-utilization,block-size,max-num-batched-tokens,max-num-seqs,enable-prefix-caching,enable-chunked-prefill. - Topology constraints: TP and DP in
{1,2,4,8}, allowed TP*DP products in{1,2,4,8}, EP fixed to 1 for this case.
Arms:
| Arm | Tuner model | Harness | Trial budget used |
|---|---|---|---|
gpt55_harness |
gpt-5.5 |
on | 2 |
gpt55_naive |
gpt-5.5 |
off | 10 |
gpt54mini_harness |
gpt-5.4-mini |
on | 2 |
gpt54mini_naive |
gpt-5.4-mini |
off | 10 |
The only intended axis inside each model pair is use_harness. The aggregate
then compares whether the weaker model plus harness can match or exceed the
stronger model without harness.
Aggregate Result
Reference best: 0.4429 req/s/GPU.
Target threshold for convergence comparisons: 95% of reference, or
0.4208 req/s/GPU.
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
|---|---|---|---|---|---|---|---|---|
gpt55_harness |
harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
gpt55_naive |
naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
gpt54mini_harness |
harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
gpt54mini_naive |
naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
Harness wins both harness-vs-naive checks:
| Harness arm | Final vs best naive | AUC vs best naive | Pass |
|---|---|---|---|
gpt55_harness |
16.2290x | 16.1296x | true |
gpt54mini_harness |
16.2290x | 16.0720x | true |
The strongest ablation observation is that gpt-5.4-mini + harness matches
gpt-5.5 + harness at the same final throughput and the same trials-to-target,
while both naive arms remain more than 16x below the harness arms by final
per-GPU throughput and AUC.
What The Harness Actually Did
The harness did not perform generic "better prompting". It inserted a measured, structured decision protocol between trial results and the next proposal.
Formally, after each trial t, AITuner observes:
o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
request_rate_t, parallel_size_t, launch status_t)
and optimizes:
J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95.
The harness maps the observation into:
b_t = ranked_bottleneck(o_t)
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
score(a) = expected_bottleneck_relief(a)
+ information_gain(a)
+ launch_safety(a)
- regression_risk(a)
- measurement_cost(a)
For this workload, the ranked bottleneck was ttft_prefill: long, heavy-tailed
prompts and a tight TTFT SLO made single-request prefill service time the
active limiter. Under that bottleneck, the high-value candidate family is a
legal TP frontier probe, because increasing TP can reduce prefill compute
latency for one request. DP-only scaling adds replicas but does not shorten the
single-request prefill path, so it can improve aggregate admission while still
failing the per-request TTFT bottleneck and the per-GPU objective.
The actual harness trajectory was:
| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
|---|---|---|---|---|---|
gpt55_harness |
1 | TP=2, DP=1 |
0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
gpt55_harness |
2 | TP=4, DP=1 |
0.4429 | 0.9718 | Ranked bottleneck is ttft_prefill; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
gpt54mini_harness |
1 | TP=2, DP=1 |
0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
gpt54mini_harness |
2 | TP=4, DP=1 |
0.4429 | 0.9727 | Same ttft_prefill topology test as the stronger model. |
The stop was also harness-mediated. Both harness arms stopped after trial 2
because the validator authorized harness_stop with:
search_high_saturated_by_incumbent
The recorded stop diagnosis was:
The incumbent's highest measured probe is feasible and is within the configured
binary-search resolution of search.high.
So the loop did not stop because an LLM guessed that tuning was done. It stopped because the incumbent saturated the configured search interval under the SLO within binary-search tolerance.
Which Knobs Were Tuned
The winning harness configuration only changed topology:
base config + tensor-parallel-size=4, data-parallel-size=1
The harness did not tune local scheduler/cache/memory knobs in the winning path. It deliberately tested topology before local runtime knobs because the active bottleneck was single-request TTFT/prefill service time.
The naive arms tuned a different knob family:
| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
|---|---|---|---|
gpt55_naive |
TP=1, DP=8 |
max-num-batched-tokens, max-num-seqs, block-size, gpu-memory-utilization, prefix caching, chunked prefill |
0.0273 |
gpt54mini_naive |
TP=1, DP=8 |
max-num-batched-tokens, max-num-seqs, block-size, gpu-memory-utilization |
0.0231 |
The first gpt55_naive proposal explicitly chose TP=1, DP=8, reasoning that
horizontal data parallelism should maximize request rate because the model fits
per GPU and TP would add communication overhead. Subsequent naive proposals kept
that DP-heavy topology and searched scheduler/cache/memory details around it.
Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
frontier that solved the bottleneck.
Why This Beats Baseline
The baseline failed because it optimized the wrong causal path.
For a TTFT/prefill-bound workload, the relevant service-time term is the latency
of one request's prefill path. A DP-heavy topology can run more independent
replicas, but each replica still handles a long prompt with TP1 compute latency.
Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
feasible sampling_u, and the objective divides by GPU usage. This is why
TP=1, DP=8 stayed near 0.02-0.027 req/s/GPU despite using all GPUs.
The harness changed the optimization direction:
observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
That sequence is measurable and falsifiable. If TP4 had improved raw latency but
materially regressed request_rate_per_gpu, the harness proposal said it should
reject the hypothesis. If the bottleneck had been admission/queueing with healthy
TTFT/TPOT service times, the same knob-effect model would have favored DP or
max-num-seqs instead. The decision was not "Qwen27B needs TP4"; it was
"ttft_prefill evidence makes TP frontier the next highest-information probe
under current constraints."
This is also why the weak-model arm matters. The weaker gpt-5.4-mini with the
harness converged to exactly the same TP frontier and final throughput as
gpt-5.5 + harness, while the stronger gpt-5.5 without harness stayed in the
wrong DP-heavy family for its whole budget. The ablation therefore attributes the
gain to the structured harness state and validators, not merely to a stronger
language model or a more verbose prompt.
Evidence Boundary
This report strongly supports the harness mechanism on the Qwen27B tight-SLO case and the model-strength ablation. It should not be overclaimed as universal proof by itself. The correct generalization claim is narrower:
- In this case, the harness improved final quality, convergence speed, AUC, and stop discipline.
- The harness made a weaker model match the stronger harnessed model and beat the stronger naive model by more than 16x.
- The successful decision was expressed in generic terms: SLO-derived bottleneck classification, topology constraints, knob-effect scoring, per-GPU objective, and validator-authorized stop.
- Additional cases are still needed to show the same mechanism across different bottlenecks, for example prefill scheduler pressure, decode TPOT pressure, memory/KV pressure, and admission/queueing pressure.
Original Aggregate Report
# qwen27b-tight-2x2-aggregate-20260623T005838Z
## Aggregate
- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
## By Kind
| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |
## Cases
### qwen27b-tight-slo-2x2-aggregate
- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |