Document Qwen27B 2x2 harness ablation

2026-06-23 10:08:46 +08:00
parent e67bc86240
commit 76ec19224c
1 changed files with 251 additions and 0 deletions
--- a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
+++ b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
@@ -0,0 +1,251 @@
 # Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
 This note organizes the aggregate report generated at:
 ```text
 .aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
 ```
 The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
 It asks whether the harness supplies reusable search structure beyond a stronger
 LLM's free-form tuning proposals.
 ## Experiment Design
 Case: `qwen27b-tight-slo-2x2-aggregate`.
 Substrate:
 - Model served: `qwen3.5-27b-256k-0223-internal`.
 - Hardware: H20, up to 8 GPUs.
 - Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
  `replay_time_scale=1.0`, `max_concurrency=32`.
 - SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
  4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
 - Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
 - Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
 - Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
  `expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
  `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
  `enable-chunked-prefill`.
 - Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
  `{1,2,4,8}`, EP fixed to 1 for this case.
 Arms:
 | Arm | Tuner model | Harness | Trial budget used |
 | --- | --- | --- | ---: |
 | `gpt55_harness` | `gpt-5.5` | on | 2 |
 | `gpt55_naive` | `gpt-5.5` | off | 10 |
 | `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
 | `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
 The only intended axis inside each model pair is `use_harness`. The aggregate
 then compares whether the weaker model plus harness can match or exceed the
 stronger model without harness.
 ## Aggregate Result
 Reference best: `0.4429 req/s/GPU`.
 Target threshold for convergence comparisons: 95% of reference, or
 `0.4208 req/s/GPU`.
 | Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
 | `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
 | `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
 | `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
 Harness wins both harness-vs-naive checks:
 | Harness arm | Final vs best naive | AUC vs best naive | Pass |
 | --- | ---: | ---: | --- |
 | `gpt55_harness` | 16.2290x | 16.1296x | true |
 | `gpt54mini_harness` | 16.2290x | 16.0720x | true |
 The strongest ablation observation is that `gpt-5.4-mini + harness` matches
 `gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
 while both naive arms remain more than 16x below the harness arms by final
 per-GPU throughput and AUC.
 ## What The Harness Actually Did
 The harness did not perform generic "better prompting". It inserted a measured,
 structured decision protocol between trial results and the next proposal.
 Formally, after each trial `t`, AITuner observes:
 ```text
 o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
       request_rate_t, parallel_size_t, launch status_t)
 ```
 and optimizes:
 ```text
 J(config_t) = request_rate_t / parallel_size_t
 subject to pass_rate_t >= 0.95.
 ```
 The harness maps the observation into:
 ```text
 b_t = ranked_bottleneck(o_t)
 A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
 score(a) = expected_bottleneck_relief(a)
         + information_gain(a)
         + launch_safety(a)
         - regression_risk(a)
         - measurement_cost(a)
 ```
 For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
 prompts and a tight TTFT SLO made single-request prefill service time the
 active limiter. Under that bottleneck, the high-value candidate family is a
 legal TP frontier probe, because increasing TP can reduce prefill compute
 latency for one request. DP-only scaling adds replicas but does not shorten the
 single-request prefill path, so it can improve aggregate admission while still
 failing the per-request TTFT bottleneck and the per-GPU objective.
 The actual harness trajectory was:
 | Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
 | --- | ---: | --- | ---: | ---: | --- |
 | `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
 | `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
 | `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
 | `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
 The stop was also harness-mediated. Both harness arms stopped after trial 2
 because the validator authorized `harness_stop` with:
 ```text
 search_high_saturated_by_incumbent
 ```
 The recorded stop diagnosis was:
 ```text
 The incumbent's highest measured probe is feasible and is within the configured
 binary-search resolution of search.high.
 ```
 So the loop did not stop because an LLM guessed that tuning was done. It stopped
 because the incumbent saturated the configured search interval under the SLO
 within binary-search tolerance.
 ## Which Knobs Were Tuned
 The winning harness configuration only changed topology:
 ```text
 base config + tensor-parallel-size=4, data-parallel-size=1
 ```
 The harness did not tune local scheduler/cache/memory knobs in the winning path.
 It deliberately tested topology before local runtime knobs because the active
 bottleneck was single-request TTFT/prefill service time.
 The naive arms tuned a different knob family:
 | Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
 | --- | --- | --- | ---: |
 | `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
 | `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
 The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
 horizontal data parallelism should maximize request rate because the model fits
 per GPU and TP would add communication overhead. Subsequent naive proposals kept
 that DP-heavy topology and searched scheduler/cache/memory details around it.
 Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
 frontier that solved the bottleneck.
 ## Why This Beats Baseline
 The baseline failed because it optimized the wrong causal path.
 For a TTFT/prefill-bound workload, the relevant service-time term is the latency
 of one request's prefill path. A DP-heavy topology can run more independent
 replicas, but each replica still handles a long prompt with TP1 compute latency.
 Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
 feasible `sampling_u`, and the objective divides by GPU usage. This is why
 `TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
 The harness changed the optimization direction:
 ```text
 observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
 -> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
 ```
 That sequence is measurable and falsifiable. If TP4 had improved raw latency but
 materially regressed `request_rate_per_gpu`, the harness proposal said it should
 reject the hypothesis. If the bottleneck had been admission/queueing with healthy
 TTFT/TPOT service times, the same knob-effect model would have favored DP or
 `max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
 "`ttft_prefill` evidence makes TP frontier the next highest-information probe
 under current constraints."
 This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
 harness converged to exactly the same TP frontier and final throughput as
 `gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
 wrong DP-heavy family for its whole budget. The ablation therefore attributes the
 gain to the structured harness state and validators, not merely to a stronger
 language model or a more verbose prompt.
 ## Evidence Boundary
 This report strongly supports the harness mechanism on the Qwen27B tight-SLO
 case and the model-strength ablation. It should not be overclaimed as universal
 proof by itself. The correct generalization claim is narrower:
 - In this case, the harness improved final quality, convergence speed, AUC, and
  stop discipline.
 - The harness made a weaker model match the stronger harnessed model and beat
  the stronger naive model by more than 16x.
 - The successful decision was expressed in generic terms: SLO-derived
  bottleneck classification, topology constraints, knob-effect scoring,
  per-GPU objective, and validator-authorized stop.
 - Additional cases are still needed to show the same mechanism across different
  bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
  memory/KV pressure, and admission/queueing pressure.
 ## Original Aggregate Report
 ```text
 # qwen27b-tight-2x2-aggregate-20260623T005838Z
 ## Aggregate
 - Cases: `1`
 - Harness-vs-naive pass/checks: `2`/`2`
 - Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
 ## By Kind
 | Kind | Arms | Mean final/ref | Mean AUC | Target reached |
 | --- | ---: | ---: | ---: | ---: |
 | `harness` | 2 | 1.0000 | 0.9467 | 2 |
 | `naive` | 2 | 0.0569 | 0.0543 | 0 |
 ## Cases
 ### qwen27b-tight-slo-2x2-aggregate
 - Reference best req/s/GPU: `0.4429`
 - Target fraction: `0.95`
 - Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
 | Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
 | `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
 | `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
 | `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
 | Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
 | --- | ---: | ---: | ---: | --- |
 | `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
 | `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
 ```