Document Qwen27B 2x2 harness ablation
This commit is contained in:
@@ -0,0 +1,251 @@
|
|||||||
|
# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
|
||||||
|
|
||||||
|
This note organizes the aggregate report generated at:
|
||||||
|
|
||||||
|
```text
|
||||||
|
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
|
||||||
|
```
|
||||||
|
|
||||||
|
The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
|
||||||
|
It asks whether the harness supplies reusable search structure beyond a stronger
|
||||||
|
LLM's free-form tuning proposals.
|
||||||
|
|
||||||
|
## Experiment Design
|
||||||
|
|
||||||
|
Case: `qwen27b-tight-slo-2x2-aggregate`.
|
||||||
|
|
||||||
|
Substrate:
|
||||||
|
|
||||||
|
- Model served: `qwen3.5-27b-256k-0223-internal`.
|
||||||
|
- Hardware: H20, up to 8 GPUs.
|
||||||
|
- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
|
||||||
|
`replay_time_scale=1.0`, `max_concurrency=32`.
|
||||||
|
- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
|
||||||
|
4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
|
||||||
|
- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
|
||||||
|
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
|
||||||
|
- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
|
||||||
|
`expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
|
||||||
|
`max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
|
||||||
|
`enable-chunked-prefill`.
|
||||||
|
- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
|
||||||
|
`{1,2,4,8}`, EP fixed to 1 for this case.
|
||||||
|
|
||||||
|
Arms:
|
||||||
|
|
||||||
|
| Arm | Tuner model | Harness | Trial budget used |
|
||||||
|
| --- | --- | --- | ---: |
|
||||||
|
| `gpt55_harness` | `gpt-5.5` | on | 2 |
|
||||||
|
| `gpt55_naive` | `gpt-5.5` | off | 10 |
|
||||||
|
| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
|
||||||
|
| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
|
||||||
|
|
||||||
|
The only intended axis inside each model pair is `use_harness`. The aggregate
|
||||||
|
then compares whether the weaker model plus harness can match or exceed the
|
||||||
|
stronger model without harness.
|
||||||
|
|
||||||
|
## Aggregate Result
|
||||||
|
|
||||||
|
Reference best: `0.4429 req/s/GPU`.
|
||||||
|
Target threshold for convergence comparisons: 95% of reference, or
|
||||||
|
`0.4208 req/s/GPU`.
|
||||||
|
|
||||||
|
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
|
||||||
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
|
||||||
|
| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
|
||||||
|
| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
|
||||||
|
| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
|
||||||
|
|
||||||
|
Harness wins both harness-vs-naive checks:
|
||||||
|
|
||||||
|
| Harness arm | Final vs best naive | AUC vs best naive | Pass |
|
||||||
|
| --- | ---: | ---: | --- |
|
||||||
|
| `gpt55_harness` | 16.2290x | 16.1296x | true |
|
||||||
|
| `gpt54mini_harness` | 16.2290x | 16.0720x | true |
|
||||||
|
|
||||||
|
The strongest ablation observation is that `gpt-5.4-mini + harness` matches
|
||||||
|
`gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
|
||||||
|
while both naive arms remain more than 16x below the harness arms by final
|
||||||
|
per-GPU throughput and AUC.
|
||||||
|
|
||||||
|
## What The Harness Actually Did
|
||||||
|
|
||||||
|
The harness did not perform generic "better prompting". It inserted a measured,
|
||||||
|
structured decision protocol between trial results and the next proposal.
|
||||||
|
|
||||||
|
Formally, after each trial `t`, AITuner observes:
|
||||||
|
|
||||||
|
```text
|
||||||
|
o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
|
||||||
|
request_rate_t, parallel_size_t, launch status_t)
|
||||||
|
```
|
||||||
|
|
||||||
|
and optimizes:
|
||||||
|
|
||||||
|
```text
|
||||||
|
J(config_t) = request_rate_t / parallel_size_t
|
||||||
|
subject to pass_rate_t >= 0.95.
|
||||||
|
```
|
||||||
|
|
||||||
|
The harness maps the observation into:
|
||||||
|
|
||||||
|
```text
|
||||||
|
b_t = ranked_bottleneck(o_t)
|
||||||
|
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
|
||||||
|
score(a) = expected_bottleneck_relief(a)
|
||||||
|
+ information_gain(a)
|
||||||
|
+ launch_safety(a)
|
||||||
|
- regression_risk(a)
|
||||||
|
- measurement_cost(a)
|
||||||
|
```
|
||||||
|
|
||||||
|
For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
|
||||||
|
prompts and a tight TTFT SLO made single-request prefill service time the
|
||||||
|
active limiter. Under that bottleneck, the high-value candidate family is a
|
||||||
|
legal TP frontier probe, because increasing TP can reduce prefill compute
|
||||||
|
latency for one request. DP-only scaling adds replicas but does not shorten the
|
||||||
|
single-request prefill path, so it can improve aggregate admission while still
|
||||||
|
failing the per-request TTFT bottleneck and the per-GPU objective.
|
||||||
|
|
||||||
|
The actual harness trajectory was:
|
||||||
|
|
||||||
|
| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
|
||||||
|
| --- | ---: | --- | ---: | ---: | --- |
|
||||||
|
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
|
||||||
|
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
|
||||||
|
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
|
||||||
|
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
|
||||||
|
|
||||||
|
The stop was also harness-mediated. Both harness arms stopped after trial 2
|
||||||
|
because the validator authorized `harness_stop` with:
|
||||||
|
|
||||||
|
```text
|
||||||
|
search_high_saturated_by_incumbent
|
||||||
|
```
|
||||||
|
|
||||||
|
The recorded stop diagnosis was:
|
||||||
|
|
||||||
|
```text
|
||||||
|
The incumbent's highest measured probe is feasible and is within the configured
|
||||||
|
binary-search resolution of search.high.
|
||||||
|
```
|
||||||
|
|
||||||
|
So the loop did not stop because an LLM guessed that tuning was done. It stopped
|
||||||
|
because the incumbent saturated the configured search interval under the SLO
|
||||||
|
within binary-search tolerance.
|
||||||
|
|
||||||
|
## Which Knobs Were Tuned
|
||||||
|
|
||||||
|
The winning harness configuration only changed topology:
|
||||||
|
|
||||||
|
```text
|
||||||
|
base config + tensor-parallel-size=4, data-parallel-size=1
|
||||||
|
```
|
||||||
|
|
||||||
|
The harness did not tune local scheduler/cache/memory knobs in the winning path.
|
||||||
|
It deliberately tested topology before local runtime knobs because the active
|
||||||
|
bottleneck was single-request TTFT/prefill service time.
|
||||||
|
|
||||||
|
The naive arms tuned a different knob family:
|
||||||
|
|
||||||
|
| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
|
||||||
|
| --- | --- | --- | ---: |
|
||||||
|
| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
|
||||||
|
| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
|
||||||
|
|
||||||
|
The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
|
||||||
|
horizontal data parallelism should maximize request rate because the model fits
|
||||||
|
per GPU and TP would add communication overhead. Subsequent naive proposals kept
|
||||||
|
that DP-heavy topology and searched scheduler/cache/memory details around it.
|
||||||
|
Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
|
||||||
|
frontier that solved the bottleneck.
|
||||||
|
|
||||||
|
## Why This Beats Baseline
|
||||||
|
|
||||||
|
The baseline failed because it optimized the wrong causal path.
|
||||||
|
|
||||||
|
For a TTFT/prefill-bound workload, the relevant service-time term is the latency
|
||||||
|
of one request's prefill path. A DP-heavy topology can run more independent
|
||||||
|
replicas, but each replica still handles a long prompt with TP1 compute latency.
|
||||||
|
Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
|
||||||
|
feasible `sampling_u`, and the objective divides by GPU usage. This is why
|
||||||
|
`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
|
||||||
|
|
||||||
|
The harness changed the optimization direction:
|
||||||
|
|
||||||
|
```text
|
||||||
|
observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
|
||||||
|
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
|
||||||
|
```
|
||||||
|
|
||||||
|
That sequence is measurable and falsifiable. If TP4 had improved raw latency but
|
||||||
|
materially regressed `request_rate_per_gpu`, the harness proposal said it should
|
||||||
|
reject the hypothesis. If the bottleneck had been admission/queueing with healthy
|
||||||
|
TTFT/TPOT service times, the same knob-effect model would have favored DP or
|
||||||
|
`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
|
||||||
|
"`ttft_prefill` evidence makes TP frontier the next highest-information probe
|
||||||
|
under current constraints."
|
||||||
|
|
||||||
|
This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
|
||||||
|
harness converged to exactly the same TP frontier and final throughput as
|
||||||
|
`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
|
||||||
|
wrong DP-heavy family for its whole budget. The ablation therefore attributes the
|
||||||
|
gain to the structured harness state and validators, not merely to a stronger
|
||||||
|
language model or a more verbose prompt.
|
||||||
|
|
||||||
|
## Evidence Boundary
|
||||||
|
|
||||||
|
This report strongly supports the harness mechanism on the Qwen27B tight-SLO
|
||||||
|
case and the model-strength ablation. It should not be overclaimed as universal
|
||||||
|
proof by itself. The correct generalization claim is narrower:
|
||||||
|
|
||||||
|
- In this case, the harness improved final quality, convergence speed, AUC, and
|
||||||
|
stop discipline.
|
||||||
|
- The harness made a weaker model match the stronger harnessed model and beat
|
||||||
|
the stronger naive model by more than 16x.
|
||||||
|
- The successful decision was expressed in generic terms: SLO-derived
|
||||||
|
bottleneck classification, topology constraints, knob-effect scoring,
|
||||||
|
per-GPU objective, and validator-authorized stop.
|
||||||
|
- Additional cases are still needed to show the same mechanism across different
|
||||||
|
bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
|
||||||
|
memory/KV pressure, and admission/queueing pressure.
|
||||||
|
|
||||||
|
## Original Aggregate Report
|
||||||
|
|
||||||
|
```text
|
||||||
|
# qwen27b-tight-2x2-aggregate-20260623T005838Z
|
||||||
|
|
||||||
|
## Aggregate
|
||||||
|
|
||||||
|
- Cases: `1`
|
||||||
|
- Harness-vs-naive pass/checks: `2`/`2`
|
||||||
|
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
|
||||||
|
|
||||||
|
## By Kind
|
||||||
|
|
||||||
|
| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
|
||||||
|
| --- | ---: | ---: | ---: | ---: |
|
||||||
|
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
|
||||||
|
| `naive` | 2 | 0.0569 | 0.0543 | 0 |
|
||||||
|
|
||||||
|
## Cases
|
||||||
|
|
||||||
|
### qwen27b-tight-slo-2x2-aggregate
|
||||||
|
|
||||||
|
- Reference best req/s/GPU: `0.4429`
|
||||||
|
- Target fraction: `0.95`
|
||||||
|
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
|
||||||
|
|
||||||
|
| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
|
||||||
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
|
||||||
|
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
|
||||||
|
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
|
||||||
|
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
|
||||||
|
|
||||||
|
| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
|
||||||
|
| --- | ---: | ---: | ---: | --- |
|
||||||
|
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
|
||||||
|
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user