Files
aituner/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md

10 KiB

Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23

This note organizes the aggregate report generated at:

.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md

The experiment is a 2x2 ablation: model strength crossed with use_harness. It asks whether the harness supplies reusable search structure beyond a stronger LLM's free-form tuning proposals.

Experiment Design

Case: qwen27b-tight-slo-2x2-aggregate.

Substrate:

  • Model served: qwen3.5-27b-256k-0223-internal.
  • Hardware: H20, up to 8 GPUs.
  • Trace: chat_w20260311_1000, input length filtered to 0-8192 tokens, replay_time_scale=1.0, max_concurrency=32.
  • SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens, 4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
  • Search: sampling_u in [0, 0.0625], tolerance 0.001, max 6 probes.
  • Tunable envs: VLLM_ENABLE_TORCH_COMPILE.
  • Tunable flags: tensor-parallel-size, data-parallel-size, expert-parallel-size, gpu-memory-utilization, block-size, max-num-batched-tokens, max-num-seqs, enable-prefix-caching, enable-chunked-prefill.
  • Topology constraints: TP and DP in {1,2,4,8}, allowed TP*DP products in {1,2,4,8}, EP fixed to 1 for this case.

Arms:

Arm Tuner model Harness Trial budget used
gpt55_harness gpt-5.5 on 2
gpt55_naive gpt-5.5 off 10
gpt54mini_harness gpt-5.4-mini on 2
gpt54mini_naive gpt-5.4-mini off 10

The only intended axis inside each model pair is use_harness. The aggregate then compares whether the weaker model plus harness can match or exceed the stronger model without harness.

Aggregate Result

Reference best: 0.4429 req/s/GPU. Target threshold for convergence comparisons: 95% of reference, or 0.4208 req/s/GPU.

Arm Kind Trials Final req/s/GPU Final/ref Trials to target Normalized AUC Failed No feasible
gpt55_harness harness 2 0.4429 1.0000 2 0.9484 0 0
gpt55_naive naive 10 0.0273 0.0616 - 0.0588 2 2
gpt54mini_harness harness 2 0.4429 1.0000 2 0.9450 0 0
gpt54mini_naive naive 10 0.0231 0.0522 - 0.0498 1 1

Harness wins both harness-vs-naive checks:

Harness arm Final vs best naive AUC vs best naive Pass
gpt55_harness 16.2290x 16.1296x true
gpt54mini_harness 16.2290x 16.0720x true

The strongest ablation observation is that gpt-5.4-mini + harness matches gpt-5.5 + harness at the same final throughput and the same trials-to-target, while both naive arms remain more than 16x below the harness arms by final per-GPU throughput and AUC.

What The Harness Actually Did

The harness did not perform generic "better prompting". It inserted a measured, structured decision protocol between trial results and the next proposal.

Formally, after each trial t, AITuner observes:

o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
       request_rate_t, parallel_size_t, launch status_t)

and optimizes:

J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95.

The harness maps the observation into:

b_t = ranked_bottleneck(o_t)
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
score(a) = expected_bottleneck_relief(a)
         + information_gain(a)
         + launch_safety(a)
         - regression_risk(a)
         - measurement_cost(a)

For this workload, the ranked bottleneck was ttft_prefill: long, heavy-tailed prompts and a tight TTFT SLO made single-request prefill service time the active limiter. Under that bottleneck, the high-value candidate family is a legal TP frontier probe, because increasing TP can reduce prefill compute latency for one request. DP-only scaling adds replicas but does not shorten the single-request prefill path, so it can improve aggregate admission while still failing the per-request TTFT bottleneck and the per-GPU objective.

The actual harness trajectory was:

Arm Trial Patch req/s/GPU Pass rate Diagnosis
gpt55_harness 1 TP=2, DP=1 0.2142 0.9572 TTFT/prefill; adjacent TP increase should reduce long-prefill latency.
gpt55_harness 2 TP=4, DP=1 0.4429 0.9718 Ranked bottleneck is ttft_prefill; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects.
gpt54mini_harness 1 TP=2, DP=1 0.1992 0.9707 TTFT/prefill; adjacent TP increase is the safest throughput-improving probe.
gpt54mini_harness 2 TP=4, DP=1 0.4429 0.9727 Same ttft_prefill topology test as the stronger model.

The stop was also harness-mediated. Both harness arms stopped after trial 2 because the validator authorized harness_stop with:

search_high_saturated_by_incumbent

The recorded stop diagnosis was:

The incumbent's highest measured probe is feasible and is within the configured
binary-search resolution of search.high.

So the loop did not stop because an LLM guessed that tuning was done. It stopped because the incumbent saturated the configured search interval under the SLO within binary-search tolerance.

Which Knobs Were Tuned

The winning harness configuration only changed topology:

base config + tensor-parallel-size=4, data-parallel-size=1

The harness did not tune local scheduler/cache/memory knobs in the winning path. It deliberately tested topology before local runtime knobs because the active bottleneck was single-request TTFT/prefill service time.

The naive arms tuned a different knob family:

Arm Topology used in all trials Runtime knobs varied Best req/s/GPU
gpt55_naive TP=1, DP=8 max-num-batched-tokens, max-num-seqs, block-size, gpu-memory-utilization, prefix caching, chunked prefill 0.0273
gpt54mini_naive TP=1, DP=8 max-num-batched-tokens, max-num-seqs, block-size, gpu-memory-utilization 0.0231

The first gpt55_naive proposal explicitly chose TP=1, DP=8, reasoning that horizontal data parallelism should maximize request rate because the model fits per GPU and TP would add communication overhead. Subsequent naive proposals kept that DP-heavy topology and searched scheduler/cache/memory details around it. Across 20 naive trial slots total, neither model entered the TP2/TP4 topology frontier that solved the bottleneck.

Why This Beats Baseline

The baseline failed because it optimized the wrong causal path.

For a TTFT/prefill-bound workload, the relevant service-time term is the latency of one request's prefill path. A DP-heavy topology can run more independent replicas, but each replica still handles a long prompt with TP1 compute latency. Under a tight per-request TTFT SLO, those replicas do not unlock a much higher feasible sampling_u, and the objective divides by GPU usage. This is why TP=1, DP=8 stayed near 0.02-0.027 req/s/GPU despite using all GPUs.

The harness changed the optimization direction:

observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated

That sequence is measurable and falsifiable. If TP4 had improved raw latency but materially regressed request_rate_per_gpu, the harness proposal said it should reject the hypothesis. If the bottleneck had been admission/queueing with healthy TTFT/TPOT service times, the same knob-effect model would have favored DP or max-num-seqs instead. The decision was not "Qwen27B needs TP4"; it was "ttft_prefill evidence makes TP frontier the next highest-information probe under current constraints."

This is also why the weak-model arm matters. The weaker gpt-5.4-mini with the harness converged to exactly the same TP frontier and final throughput as gpt-5.5 + harness, while the stronger gpt-5.5 without harness stayed in the wrong DP-heavy family for its whole budget. The ablation therefore attributes the gain to the structured harness state and validators, not merely to a stronger language model or a more verbose prompt.

Evidence Boundary

This report strongly supports the harness mechanism on the Qwen27B tight-SLO case and the model-strength ablation. It should not be overclaimed as universal proof by itself. The correct generalization claim is narrower:

  • In this case, the harness improved final quality, convergence speed, AUC, and stop discipline.
  • The harness made a weaker model match the stronger harnessed model and beat the stronger naive model by more than 16x.
  • The successful decision was expressed in generic terms: SLO-derived bottleneck classification, topology constraints, knob-effect scoring, per-GPU objective, and validator-authorized stop.
  • Additional cases are still needed to show the same mechanism across different bottlenecks, for example prefill scheduler pressure, decode TPOT pressure, memory/KV pressure, and admission/queueing pressure.

Original Aggregate Report

# qwen27b-tight-2x2-aggregate-20260623T005838Z

## Aggregate

- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`

## By Kind

| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |

## Cases

### qwen27b-tight-slo-2x2-aggregate

- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`

| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |

| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |