Files

Gahow Wang 76ec19224c Document Qwen27B 2x2 harness ablation

2026-06-23 10:08:46 +08:00

10 KiB

Raw Blame History

Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23

This note organizes the aggregate report generated at:

.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md

The experiment is a 2x2 ablation: model strength crossed with use_harness. It asks whether the harness supplies reusable search structure beyond a stronger LLM's free-form tuning proposals.

Experiment Design

Case: qwen27b-tight-slo-2x2-aggregate.

Substrate:

Model served: qwen3.5-27b-256k-0223-internal.
Hardware: H20, up to 8 GPUs.
Trace: chat_w20260311_1000, input length filtered to 0-8192 tokens, replay_time_scale=1.0, max_concurrency=32.
SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens, 4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
Search: sampling_u in [0, 0.0625], tolerance 0.001, max 6 probes.
Tunable envs: VLLM_ENABLE_TORCH_COMPILE.
Tunable flags: tensor-parallel-size, data-parallel-size, expert-parallel-size, gpu-memory-utilization, block-size, max-num-batched-tokens, max-num-seqs, enable-prefix-caching, enable-chunked-prefill.
Topology constraints: TP and DP in {1,2,4,8}, allowed TP*DP products in {1,2,4,8}, EP fixed to 1 for this case.

Arms:

Arm	Tuner model	Harness	Trial budget used
`gpt55_harness`	`gpt-5.5`	on	2
`gpt55_naive`	`gpt-5.5`	off	10
`gpt54mini_harness`	`gpt-5.4-mini`	on	2
`gpt54mini_naive`	`gpt-5.4-mini`	off	10

The only intended axis inside each model pair is use_harness. The aggregate then compares whether the weaker model plus harness can match or exceed the stronger model without harness.

Aggregate Result

Reference best: 0.4429 req/s/GPU. Target threshold for convergence comparisons: 95% of reference, or 0.4208 req/s/GPU.

Arm	Kind	Trials	Final req/s/GPU	Final/ref	Trials to target	Normalized AUC	Failed	No feasible
`gpt55_harness`	harness	2	0.4429	1.0000	2	0.9484	0	0
`gpt55_naive`	naive	10	0.0273	0.0616	-	0.0588	2	2
`gpt54mini_harness`	harness	2	0.4429	1.0000	2	0.9450	0	0
`gpt54mini_naive`	naive	10	0.0231	0.0522	-	0.0498	1	1

Harness wins both harness-vs-naive checks:

Harness arm	Final vs best naive	AUC vs best naive	Pass
`gpt55_harness`	16.2290x	16.1296x	true
`gpt54mini_harness`	16.2290x	16.0720x	true

The strongest ablation observation is that gpt-5.4-mini + harness matches gpt-5.5 + harness at the same final throughput and the same trials-to-target, while both naive arms remain more than 16x below the harness arms by final per-GPU throughput and AUC.

What The Harness Actually Did

The harness did not perform generic "better prompting". It inserted a measured, structured decision protocol between trial results and the next proposal.

Formally, after each trial t, AITuner observes:

o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
       request_rate_t, parallel_size_t, launch status_t)

and optimizes:

J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95.

The harness maps the observation into:

b_t = ranked_bottleneck(o_t)
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
score(a) = expected_bottleneck_relief(a)
         + information_gain(a)
         + launch_safety(a)
         - regression_risk(a)
         - measurement_cost(a)

For this workload, the ranked bottleneck was ttft_prefill: long, heavy-tailed prompts and a tight TTFT SLO made single-request prefill service time the active limiter. Under that bottleneck, the high-value candidate family is a legal TP frontier probe, because increasing TP can reduce prefill compute latency for one request. DP-only scaling adds replicas but does not shorten the single-request prefill path, so it can improve aggregate admission while still failing the per-request TTFT bottleneck and the per-GPU objective.

The actual harness trajectory was:

Arm	Trial	Patch	req/s/GPU	Pass rate	Diagnosis
`gpt55_harness`	1	`TP=2, DP=1`	0.2142	0.9572	TTFT/prefill; adjacent TP increase should reduce long-prefill latency.
`gpt55_harness`	2	`TP=4, DP=1`	0.4429	0.9718	Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects.
`gpt54mini_harness`	1	`TP=2, DP=1`	0.1992	0.9707	TTFT/prefill; adjacent TP increase is the safest throughput-improving probe.
`gpt54mini_harness`	2	`TP=4, DP=1`	0.4429	0.9727	Same `ttft_prefill` topology test as the stronger model.

The stop was also harness-mediated. Both harness arms stopped after trial 2 because the validator authorized harness_stop with:

search_high_saturated_by_incumbent

The recorded stop diagnosis was:

The incumbent's highest measured probe is feasible and is within the configured
binary-search resolution of search.high.

So the loop did not stop because an LLM guessed that tuning was done. It stopped because the incumbent saturated the configured search interval under the SLO within binary-search tolerance.

Which Knobs Were Tuned

The winning harness configuration only changed topology:

base config + tensor-parallel-size=4, data-parallel-size=1

The harness did not tune local scheduler/cache/memory knobs in the winning path. It deliberately tested topology before local runtime knobs because the active bottleneck was single-request TTFT/prefill service time.

The naive arms tuned a different knob family:

Arm	Topology used in all trials	Runtime knobs varied	Best req/s/GPU
`gpt55_naive`	`TP=1, DP=8`	`max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill	0.0273
`gpt54mini_naive`	`TP=1, DP=8`	`max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`	0.0231

The first gpt55_naive proposal explicitly chose TP=1, DP=8, reasoning that horizontal data parallelism should maximize request rate because the model fits per GPU and TP would add communication overhead. Subsequent naive proposals kept that DP-heavy topology and searched scheduler/cache/memory details around it. Across 20 naive trial slots total, neither model entered the TP2/TP4 topology frontier that solved the bottleneck.

Why This Beats Baseline

The baseline failed because it optimized the wrong causal path.

For a TTFT/prefill-bound workload, the relevant service-time term is the latency of one request's prefill path. A DP-heavy topology can run more independent replicas, but each replica still handles a long prompt with TP1 compute latency. Under a tight per-request TTFT SLO, those replicas do not unlock a much higher feasible sampling_u, and the objective divides by GPU usage. This is why TP=1, DP=8 stayed near 0.02-0.027 req/s/GPU despite using all GPUs.

The harness changed the optimization direction:

observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated

That sequence is measurable and falsifiable. If TP4 had improved raw latency but materially regressed request_rate_per_gpu, the harness proposal said it should reject the hypothesis. If the bottleneck had been admission/queueing with healthy TTFT/TPOT service times, the same knob-effect model would have favored DP or max-num-seqs instead. The decision was not "Qwen27B needs TP4"; it was "ttft_prefill evidence makes TP frontier the next highest-information probe under current constraints."

This is also why the weak-model arm matters. The weaker gpt-5.4-mini with the harness converged to exactly the same TP frontier and final throughput as gpt-5.5 + harness, while the stronger gpt-5.5 without harness stayed in the wrong DP-heavy family for its whole budget. The ablation therefore attributes the gain to the structured harness state and validators, not merely to a stronger language model or a more verbose prompt.

Evidence Boundary

This report strongly supports the harness mechanism on the Qwen27B tight-SLO case and the model-strength ablation. It should not be overclaimed as universal proof by itself. The correct generalization claim is narrower:

In this case, the harness improved final quality, convergence speed, AUC, and stop discipline.
The harness made a weaker model match the stronger harnessed model and beat the stronger naive model by more than 16x.
The successful decision was expressed in generic terms: SLO-derived bottleneck classification, topology constraints, knob-effect scoring, per-GPU objective, and validator-authorized stop.
Additional cases are still needed to show the same mechanism across different bottlenecks, for example prefill scheduler pressure, decode TPOT pressure, memory/KV pressure, and admission/queueing pressure.

Original Aggregate Report

# qwen27b-tight-2x2-aggregate-20260623T005838Z

## Aggregate

- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`

## By Kind

| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |

## Cases

### qwen27b-tight-slo-2x2-aggregate

- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`

| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |

| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |

10 KiB Raw Blame History