Document Qwen27B 2x2 harness ablation

This commit is contained in:
2026-06-23 10:08:46 +08:00
parent e67bc86240
commit 76ec19224c

View File

@@ -0,0 +1,251 @@
# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
This note organizes the aggregate report generated at:
```text
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
```
The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
It asks whether the harness supplies reusable search structure beyond a stronger
LLM's free-form tuning proposals.
## Experiment Design
Case: `qwen27b-tight-slo-2x2-aggregate`.
Substrate:
- Model served: `qwen3.5-27b-256k-0223-internal`.
- Hardware: H20, up to 8 GPUs.
- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
`replay_time_scale=1.0`, `max_concurrency=32`.
- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
`expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
`max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
`enable-chunked-prefill`.
- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
`{1,2,4,8}`, EP fixed to 1 for this case.
Arms:
| Arm | Tuner model | Harness | Trial budget used |
| --- | --- | --- | ---: |
| `gpt55_harness` | `gpt-5.5` | on | 2 |
| `gpt55_naive` | `gpt-5.5` | off | 10 |
| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
The only intended axis inside each model pair is `use_harness`. The aggregate
then compares whether the weaker model plus harness can match or exceed the
stronger model without harness.
## Aggregate Result
Reference best: `0.4429 req/s/GPU`.
Target threshold for convergence comparisons: 95% of reference, or
`0.4208 req/s/GPU`.
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
Harness wins both harness-vs-naive checks:
| Harness arm | Final vs best naive | AUC vs best naive | Pass |
| --- | ---: | ---: | --- |
| `gpt55_harness` | 16.2290x | 16.1296x | true |
| `gpt54mini_harness` | 16.2290x | 16.0720x | true |
The strongest ablation observation is that `gpt-5.4-mini + harness` matches
`gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
while both naive arms remain more than 16x below the harness arms by final
per-GPU throughput and AUC.
## What The Harness Actually Did
The harness did not perform generic "better prompting". It inserted a measured,
structured decision protocol between trial results and the next proposal.
Formally, after each trial `t`, AITuner observes:
```text
o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
request_rate_t, parallel_size_t, launch status_t)
```
and optimizes:
```text
J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95.
```
The harness maps the observation into:
```text
b_t = ranked_bottleneck(o_t)
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
score(a) = expected_bottleneck_relief(a)
+ information_gain(a)
+ launch_safety(a)
- regression_risk(a)
- measurement_cost(a)
```
For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
prompts and a tight TTFT SLO made single-request prefill service time the
active limiter. Under that bottleneck, the high-value candidate family is a
legal TP frontier probe, because increasing TP can reduce prefill compute
latency for one request. DP-only scaling adds replicas but does not shorten the
single-request prefill path, so it can improve aggregate admission while still
failing the per-request TTFT bottleneck and the per-GPU objective.
The actual harness trajectory was:
| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
| --- | ---: | --- | ---: | ---: | --- |
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
The stop was also harness-mediated. Both harness arms stopped after trial 2
because the validator authorized `harness_stop` with:
```text
search_high_saturated_by_incumbent
```
The recorded stop diagnosis was:
```text
The incumbent's highest measured probe is feasible and is within the configured
binary-search resolution of search.high.
```
So the loop did not stop because an LLM guessed that tuning was done. It stopped
because the incumbent saturated the configured search interval under the SLO
within binary-search tolerance.
## Which Knobs Were Tuned
The winning harness configuration only changed topology:
```text
base config + tensor-parallel-size=4, data-parallel-size=1
```
The harness did not tune local scheduler/cache/memory knobs in the winning path.
It deliberately tested topology before local runtime knobs because the active
bottleneck was single-request TTFT/prefill service time.
The naive arms tuned a different knob family:
| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
| --- | --- | --- | ---: |
| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
horizontal data parallelism should maximize request rate because the model fits
per GPU and TP would add communication overhead. Subsequent naive proposals kept
that DP-heavy topology and searched scheduler/cache/memory details around it.
Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
frontier that solved the bottleneck.
## Why This Beats Baseline
The baseline failed because it optimized the wrong causal path.
For a TTFT/prefill-bound workload, the relevant service-time term is the latency
of one request's prefill path. A DP-heavy topology can run more independent
replicas, but each replica still handles a long prompt with TP1 compute latency.
Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
feasible `sampling_u`, and the objective divides by GPU usage. This is why
`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
The harness changed the optimization direction:
```text
observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
```
That sequence is measurable and falsifiable. If TP4 had improved raw latency but
materially regressed `request_rate_per_gpu`, the harness proposal said it should
reject the hypothesis. If the bottleneck had been admission/queueing with healthy
TTFT/TPOT service times, the same knob-effect model would have favored DP or
`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
"`ttft_prefill` evidence makes TP frontier the next highest-information probe
under current constraints."
This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
harness converged to exactly the same TP frontier and final throughput as
`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
wrong DP-heavy family for its whole budget. The ablation therefore attributes the
gain to the structured harness state and validators, not merely to a stronger
language model or a more verbose prompt.
## Evidence Boundary
This report strongly supports the harness mechanism on the Qwen27B tight-SLO
case and the model-strength ablation. It should not be overclaimed as universal
proof by itself. The correct generalization claim is narrower:
- In this case, the harness improved final quality, convergence speed, AUC, and
stop discipline.
- The harness made a weaker model match the stronger harnessed model and beat
the stronger naive model by more than 16x.
- The successful decision was expressed in generic terms: SLO-derived
bottleneck classification, topology constraints, knob-effect scoring,
per-GPU objective, and validator-authorized stop.
- Additional cases are still needed to show the same mechanism across different
bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
memory/KV pressure, and admission/queueing pressure.
## Original Aggregate Report
```text
# qwen27b-tight-2x2-aggregate-20260623T005838Z
## Aggregate
- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
## By Kind
| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |
## Cases
### qwen27b-tight-slo-2x2-aggregate
- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
```