aituner/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md

# Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23

本文整理以下 aggregate report，并解释 harness 为什么能够让 tuning 更快、更有效：

```text
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
```

这个实验是一个 2x2 ablation：模型强弱和是否启用 `use_harness` 交叉。
核心问题是：harness 是否提供了可复用的搜索结构，而不仅仅是更强 LLM
或者更长 prompt 带来的偶然收益。

## 实验设计

Case: `qwen27b-tight-slo-2x2-aggregate`。

实验基座：

- Served model: `qwen3.5-27b-256k-0223-internal`。
- Hardware: H20，最多 8 GPUs。
- Trace: `chat_w20260311_1000`，输入长度过滤到 0-8192 tokens，
  `replay_time_scale=1.0`，`max_concurrency=32`。
- SLO: pass rate >= 0.95；TTFT step rule 为 <=4096 input tokens 时 2s，
  <=32768 input tokens 时 4s，更长输入时 6s；TPOT <= 50 ms。
- Search: 在 `sampling_u in [0, 0.0625]` 上二分探测，tolerance 0.001，
  max 6 probes。
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`。
- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
  `expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
  `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
  `enable-chunked-prefill`。
- Topology constraints: TP 和 DP 均在 `{1,2,4,8}` 中，允许的 TP*DP product 为
  `{1,2,4,8}`，本 case 中 EP 固定为 1。

2x2 arms:

| Arm | Tuner model | Harness | Trial budget used |
| --- | --- | --- | ---: |
| `gpt55_harness` | `gpt-5.5` | on | 2 |
| `gpt55_naive` | `gpt-5.5` | off | 10 |
| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |

同一个 tuner model 内，主要差异是 `use_harness`。跨模型比较则用来判断：
更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。

## Aggregate result

Reference best: `0.4429 req/s/GPU`。
Convergence target: reference 的 95%，即 `0.4208 req/s/GPU`。

| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |

Harness-vs-naive 检查全部通过：

| Harness arm | Final vs best naive | AUC vs best naive | Pass |
| --- | ---: | ---: | --- |
| `gpt55_harness` | 16.2290x | 16.1296x | true |
| `gpt54mini_harness` | 16.2290x | 16.0720x | true |

最关键的 ablation 信号是：`gpt-5.4-mini + harness` 和
`gpt-5.5 + harness` 达到同一个 final throughput，也都是 2 trials 达到 target；
而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。

## Agent loop 流程图

下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal，
但它拿到的不是裸文本历史，而是结构化 observation、bottleneck diagnosis、
candidate actions 和 validator 约束；同时 validator 可以授权 stop，也可以阻止
重复失败或不合法配置。

```mermaid
flowchart TD
    A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config]
    B --> C[Binary-search probes over sampling_u]
    C --> D[Build observation o_t]
    D --> E[Bottleneck classifier]
    E --> F[Candidate family generator]
    F --> G[Score candidate actions]
    G --> H[Prompt renderer / planner]
    H --> I[LLM or deterministic harness proposal]
    I --> J{Config validator}
    J -- invalid, repeated, unsafe --> F
    J -- valid config_patch --> B
    G --> K{Stop validator}
    K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent]
    K -- useful candidates remain --> H
```

这个 loop 中，harness 的作用不是把 prompt 写得更漂亮，而是把 tuning 变成
一个受测量约束的决策过程：

```text
measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop
```

## 形式化设计：observation

每个 trial 结束后，AITuner 不只记录一段自然语言总结，而是形成结构化 observation：

```text
o_t = (
  config_t,
  probe_history_t,
  pass_rate_t,
  latency/SLO_failure_profile_t,
  request_rate_t,
  parallel_size_t,
  launch_status_t,
  prior_failures_t,
  incumbent_t
)
```

本实验里 observation 中最重要的字段是：

- `config_t`: 当前 trial 的 `flag_patch` 和 `env_patch`，例如 `TP=2, DP=1`。
- `probe_history_t`: 在不同 `sampling_u` 下二分探测得到的 feasible/infeasible
  结果。
- `pass_rate_t`: 是否满足 target pass rate 0.95。
- `latency/SLO_failure_profile_t`: TTFT 和 TPOT 哪个先触发 SLO pressure。
- `request_rate_t`: 当前配置在 SLO 下能承载的 request rate。
- `parallel_size_t`: 该配置实际使用的并行规模，用于归一化 per-GPU objective。
- `prior_failures_t`: 之前哪些配置 launch failed 或 no feasible，避免重复试错。
- `incumbent_t`: 当前最优配置及其 `request_rate_per_gpu`。

目标函数是：

```text
J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95
```

也就是说，harness 优化的是满足 SLO 后的 `req/s/GPU`，不是 raw throughput，
也不是 LLM 主观认为“更强”的配置。

## 形式化设计：bottleneck classifier

`bottleneck classifier` 把 observation 映射成 ranked bottleneck hypotheses：

```text
b_t = ranked_bottleneck(o_t)
```

它判断的不是“哪个 knob 看起来常用”，而是“当前 SLO failure 和 latency profile
说明哪个系统环节在限制 objective”。

常见分类包括：

| Bottleneck | 典型证据 | 倾向 knob family |
| --- | --- | --- |
| `ttft_prefill` | 长 prompt 下 TTFT 接近或超过 SLO，prefill service time 是瓶颈 | 提高 TP，调整 prefill batching |
| `decode_tpot` | TPOT p95/p99 超 SLO，decode token latency 是瓶颈 | 调整 `max-num-seqs`，提高 TP，降低 decode contention |
| `admission_queueing` | waiting/arrival lag 增长，服务时间未必单独变差 | 提高 DP，调整 admission/concurrency knobs |
| `memory_kv` | KV cache pressure、preemption、OOM、launch failure | 调整 `gpu-memory-utilization`、`block-size`、sequence/token caps |
| `topology_comm` | TP 增加降低 latency 但 per-GPU efficiency 下降 | 回退 TP，比较 DP/TP tradeoff |

本实验里，两个 harness arms 都把 ranked bottleneck 识别为
`ttft_prefill`。原因是 workload 有 heavy-tailed long prompts，并且 TTFT SLO 很紧；
这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica，
不能缩短一个长 prompt 的 prefill 路径，因此不是第一优先级。

## 形式化设计：candidate family

`candidate family generator` 根据 bottleneck 和 topology constraints 生成可比较的
action family：

```text
A_t = candidate_knob_families(
  b_t,
  topology_constraints,
  prior_failures_t,
  incumbent_t
)
```

在这个 case 中：

- `b_t = ttft_prefill`。
- 允许的 TP frontier 是 `TP=1 -> TP=2 -> TP=4 -> TP=8`。
- 允许的 DP frontier 是 `DP=1,2,4,8`，但 DP-only 不直接缓解单请求 prefill
  latency。
- EP 固定为 1，因此不探索 expert parallel。
- 之前没有 failed topology，因此相邻 TP probe launch risk 低。

所以 harness 选择了：

```text
trial-0001: TP=2, DP=1
trial-0002: TP=4, DP=1
```

这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是
`admission_queueing`，candidate family 会更偏向 DP 或 `max-num-seqs`；如果输出是
`memory_kv`，则会更偏向 memory/cache/sequence knobs。

## 形式化设计：scoring

每个 candidate action 都按同一个抽象打分：

```text
score(a) = expected_bottleneck_relief(a)
         + information_gain(a)
         + launch_safety(a)
         - regression_risk(a)
         - measurement_cost(a)
```

这些项在本实验里的含义是：

- `expected_bottleneck_relief`: TP2/TP4 预计能降低 long-prefill compute latency，
  直接作用于 `ttft_prefill`。
- `information_gain`: TP frontier probe 可以区分“需要 compute-latency relief”
  还是“只是 admission/replica 不够”。
- `launch_safety`: TP2/TP4 均满足 topology constraints，没有重复 failed signature。
- `regression_risk`: TP 增加会带来通信开销，可能损害 per-GPU efficiency，所以必须用
  `request_rate_per_gpu` 验证。
- `measurement_cost`: 每个 GPU trial 成本高；因此高信息量的 topology probe 优先于
  多个局部 runtime tweak。

实际结果验证了这个 scoring：

| Arm | Trial | Patch | req/s/GPU | Pass rate | 解释 |
| --- | ---: | --- | ---: | ---: | --- |
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | 相邻 TP probe 已满足 SLO，但仍未饱和 search high。 |
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | TP frontier 继续缓解 prefill bottleneck，达到 reference best。 |
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | 弱模型也选择同一机制路径。 |
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | 弱模型加 harness 匹配强模型加 harness。 |

## 形式化设计：validator stop

Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 `stop validator`：

```text
stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false
```

本实验里 stop 的记录是：

```text
tuning_stop_reason: harness_stop
validator_reason: search_high_saturated_by_incumbent
diagnosis: The incumbent's highest measured probe is feasible and is within the
configured binary-search resolution of search.high.
```

含义是：

1. 当前 incumbent 的最高测量 probe 已经 feasible。
2. 该 feasible probe 距离 `search.high` 已经在 binary-search tolerance 内。
3. 在当前搜索区间和 SLO 约束下，继续花 GPU trial 很难提高 measured objective。
4. 因此 validator 授权 stop，并保留当前 incumbent。

这给 harness 带来了 stop discipline：它既不会因为 LLM 过早自信而随便停，也不会在
已经 saturate search high 后继续 burn budget。

## 实际 tune 了哪些 knobs

Harness winning path 只改了 topology：

```text
base config + tensor-parallel-size=4, data-parallel-size=1
```

它没有在 winning path 中调 scheduler/cache/memory knobs，因为 `ttft_prefill`
bottleneck 下，首要动作是缩短单请求 prefill service time。

Naive arms 则走了另一个方向：

| Arm | 所有 trials 使用的 topology | 变化过的 runtime knobs | Best req/s/GPU |
| --- | --- | --- | ---: |
| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |

`gpt55_naive` 的第一个 proposal 明确选择 `TP=1, DP=8`，理由是模型能单卡放下，
因此 horizontal data parallelism 应该最大化 request rate，而 TP 会带来通信开销。
之后 naive proposals 一直保留 DP-heavy topology，只围绕 runtime knobs 搜索。
两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。

## 为什么比 baseline 更好

Baseline 失败的原因是优化了错误的因果路径。

对 `ttft_prefill`-bound workload，关键服务时间是单个请求的 prefill latency。
DP-heavy topology 可以增加 replica 数，但每个 replica 仍用 TP1 处理长 prompt；
它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下，这会导致 feasible
`sampling_u` 很低；再除以 GPU 数得到 `req/s/GPU` 后，结果只有
`0.02-0.027 req/s/GPU`。

Harness 的优化路径是：

```text
observed SLO pressure
-> classify as ttft_prefill
-> choose legal TP frontier probe
-> measure feasible req/s/GPU under the same SLO
-> stop only when search.high is saturated by incumbent
```

这条路径是可测量、可反驳的。如果 TP4 降低了 latency 但
`request_rate_per_gpu` 明显下降，harness 会 reject 这个 hypothesis。如果
bottleneck 是 admission/queueing 而不是 TTFT/prefill，同一个 knob-effect model
会偏向 DP 或 `max-num-seqs`，而不是 TP frontier。

因此，这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是：
harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family，
再用 per-GPU objective 和 validator stop 验证这个方向。

## 证据边界

这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制，但不能单独当作通用性证明。
当前可成立的结论是：

- 在这个 case 中，harness 同时提升了 final quality、convergence speed、AUC 和
  stop discipline。
- `gpt-5.4-mini + harness` 匹配 `gpt-5.5 + harness`，并显著超过
  `gpt-5.5 + naive`，说明收益主要来自 harness 的结构化状态和 validator，而不是
  单纯来自更强模型。
- 成功路径用的是通用机制：SLO-derived bottleneck classification、topology
  constraints、knob-effect scoring、per-GPU objective、validator-authorized stop。
- 还需要在其他 bottleneck/case 上继续验证，例如 prefill scheduler pressure、
  decode TPOT pressure、memory/KV pressure、admission/queueing pressure。

## 原始 aggregate report 摘录

```text
# qwen27b-tight-2x2-aggregate-20260623T005838Z

## Aggregate

- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`

## By Kind

| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |

## Cases

### qwen27b-tight-slo-2x2-aggregate

- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`

| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |

| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
```