diff --git a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md new file mode 100644 index 0000000..dab994c --- /dev/null +++ b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md @@ -0,0 +1,251 @@ +# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23 + +This note organizes the aggregate report generated at: + +```text +.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md +``` + +The experiment is a 2x2 ablation: model strength crossed with `use_harness`. +It asks whether the harness supplies reusable search structure beyond a stronger +LLM's free-form tuning proposals. + +## Experiment Design + +Case: `qwen27b-tight-slo-2x2-aggregate`. + +Substrate: + +- Model served: `qwen3.5-27b-256k-0223-internal`. +- Hardware: H20, up to 8 GPUs. +- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens, + `replay_time_scale=1.0`, `max_concurrency=32`. +- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens, + 4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms. +- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes. +- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`. +- Tunable flags: `tensor-parallel-size`, `data-parallel-size`, + `expert-parallel-size`, `gpu-memory-utilization`, `block-size`, + `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`, + `enable-chunked-prefill`. +- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in + `{1,2,4,8}`, EP fixed to 1 for this case. + +Arms: + +| Arm | Tuner model | Harness | Trial budget used | +| --- | --- | --- | ---: | +| `gpt55_harness` | `gpt-5.5` | on | 2 | +| `gpt55_naive` | `gpt-5.5` | off | 10 | +| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 | +| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 | + +The only intended axis inside each model pair is `use_harness`. The aggregate +then compares whether the weaker model plus harness can match or exceed the +stronger model without harness. + +## Aggregate Result + +Reference best: `0.4429 req/s/GPU`. +Target threshold for convergence comparisons: 95% of reference, or +`0.4208 req/s/GPU`. + +| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible | +| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 | +| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 | +| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 | +| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 | + +Harness wins both harness-vs-naive checks: + +| Harness arm | Final vs best naive | AUC vs best naive | Pass | +| --- | ---: | ---: | --- | +| `gpt55_harness` | 16.2290x | 16.1296x | true | +| `gpt54mini_harness` | 16.2290x | 16.0720x | true | + +The strongest ablation observation is that `gpt-5.4-mini + harness` matches +`gpt-5.5 + harness` at the same final throughput and the same trials-to-target, +while both naive arms remain more than 16x below the harness arms by final +per-GPU throughput and AUC. + +## What The Harness Actually Did + +The harness did not perform generic "better prompting". It inserted a measured, +structured decision protocol between trial results and the next proposal. + +Formally, after each trial `t`, AITuner observes: + +```text +o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t, + request_rate_t, parallel_size_t, launch status_t) +``` + +and optimizes: + +```text +J(config_t) = request_rate_t / parallel_size_t +subject to pass_rate_t >= 0.95. +``` + +The harness maps the observation into: + +```text +b_t = ranked_bottleneck(o_t) +A_t = candidate_knob_families(b_t, topology_constraints, prior_failures) +score(a) = expected_bottleneck_relief(a) + + information_gain(a) + + launch_safety(a) + - regression_risk(a) + - measurement_cost(a) +``` + +For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed +prompts and a tight TTFT SLO made single-request prefill service time the +active limiter. Under that bottleneck, the high-value candidate family is a +legal TP frontier probe, because increasing TP can reduce prefill compute +latency for one request. DP-only scaling adds replicas but does not shorten the +single-request prefill path, so it can improve aggregate admission while still +failing the per-request TTFT bottleneck and the per-GPU objective. + +The actual harness trajectory was: + +| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis | +| --- | ---: | --- | ---: | ---: | --- | +| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. | +| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. | +| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. | +| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. | + +The stop was also harness-mediated. Both harness arms stopped after trial 2 +because the validator authorized `harness_stop` with: + +```text +search_high_saturated_by_incumbent +``` + +The recorded stop diagnosis was: + +```text +The incumbent's highest measured probe is feasible and is within the configured +binary-search resolution of search.high. +``` + +So the loop did not stop because an LLM guessed that tuning was done. It stopped +because the incumbent saturated the configured search interval under the SLO +within binary-search tolerance. + +## Which Knobs Were Tuned + +The winning harness configuration only changed topology: + +```text +base config + tensor-parallel-size=4, data-parallel-size=1 +``` + +The harness did not tune local scheduler/cache/memory knobs in the winning path. +It deliberately tested topology before local runtime knobs because the active +bottleneck was single-request TTFT/prefill service time. + +The naive arms tuned a different knob family: + +| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU | +| --- | --- | --- | ---: | +| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 | +| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 | + +The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that +horizontal data parallelism should maximize request rate because the model fits +per GPU and TP would add communication overhead. Subsequent naive proposals kept +that DP-heavy topology and searched scheduler/cache/memory details around it. +Across 20 naive trial slots total, neither model entered the TP2/TP4 topology +frontier that solved the bottleneck. + +## Why This Beats Baseline + +The baseline failed because it optimized the wrong causal path. + +For a TTFT/prefill-bound workload, the relevant service-time term is the latency +of one request's prefill path. A DP-heavy topology can run more independent +replicas, but each replica still handles a long prompt with TP1 compute latency. +Under a tight per-request TTFT SLO, those replicas do not unlock a much higher +feasible `sampling_u`, and the objective divides by GPU usage. This is why +`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs. + +The harness changed the optimization direction: + +```text +observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier +-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated +``` + +That sequence is measurable and falsifiable. If TP4 had improved raw latency but +materially regressed `request_rate_per_gpu`, the harness proposal said it should +reject the hypothesis. If the bottleneck had been admission/queueing with healthy +TTFT/TPOT service times, the same knob-effect model would have favored DP or +`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was +"`ttft_prefill` evidence makes TP frontier the next highest-information probe +under current constraints." + +This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the +harness converged to exactly the same TP frontier and final throughput as +`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the +wrong DP-heavy family for its whole budget. The ablation therefore attributes the +gain to the structured harness state and validators, not merely to a stronger +language model or a more verbose prompt. + +## Evidence Boundary + +This report strongly supports the harness mechanism on the Qwen27B tight-SLO +case and the model-strength ablation. It should not be overclaimed as universal +proof by itself. The correct generalization claim is narrower: + +- In this case, the harness improved final quality, convergence speed, AUC, and + stop discipline. +- The harness made a weaker model match the stronger harnessed model and beat + the stronger naive model by more than 16x. +- The successful decision was expressed in generic terms: SLO-derived + bottleneck classification, topology constraints, knob-effect scoring, + per-GPU objective, and validator-authorized stop. +- Additional cases are still needed to show the same mechanism across different + bottlenecks, for example prefill scheduler pressure, decode TPOT pressure, + memory/KV pressure, and admission/queueing pressure. + +## Original Aggregate Report + +```text +# qwen27b-tight-2x2-aggregate-20260623T005838Z + +## Aggregate + +- Cases: `1` +- Harness-vs-naive pass/checks: `2`/`2` +- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}` + +## By Kind + +| Kind | Arms | Mean final/ref | Mean AUC | Target reached | +| --- | ---: | ---: | ---: | ---: | +| `harness` | 2 | 1.0000 | 0.9467 | 2 | +| `naive` | 2 | 0.0569 | 0.0543 | 0 | + +## Cases + +### qwen27b-tight-slo-2x2-aggregate + +- Reference best req/s/GPU: `0.4429` +- Target fraction: `0.95` +- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}` + +| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible | +| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 | +| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 | +| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 | +| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 | + +| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass | +| --- | ---: | ---: | ---: | --- | +| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` | +| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` | +```