367 lines
15 KiB
Markdown
367 lines
15 KiB
Markdown
# Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23
|
||
|
||
本文整理以下 aggregate report,并解释 harness 为什么能够让 tuning 更快、更有效:
|
||
|
||
```text
|
||
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
|
||
```
|
||
|
||
这个实验是一个 2x2 ablation:模型强弱和是否启用 `use_harness` 交叉。
|
||
核心问题是:harness 是否提供了可复用的搜索结构,而不仅仅是更强 LLM
|
||
或者更长 prompt 带来的偶然收益。
|
||
|
||
## 实验设计
|
||
|
||
Case: `qwen27b-tight-slo-2x2-aggregate`。
|
||
|
||
实验基座:
|
||
|
||
- Served model: `qwen3.5-27b-256k-0223-internal`。
|
||
- Hardware: H20,最多 8 GPUs。
|
||
- Trace: `chat_w20260311_1000`,输入长度过滤到 0-8192 tokens,
|
||
`replay_time_scale=1.0`,`max_concurrency=32`。
|
||
- SLO: pass rate >= 0.95;TTFT step rule 为 <=4096 input tokens 时 2s,
|
||
<=32768 input tokens 时 4s,更长输入时 6s;TPOT <= 50 ms。
|
||
- Search: 在 `sampling_u in [0, 0.0625]` 上二分探测,tolerance 0.001,
|
||
max 6 probes。
|
||
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`。
|
||
- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
|
||
`expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
|
||
`max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
|
||
`enable-chunked-prefill`。
|
||
- Topology constraints: TP 和 DP 均在 `{1,2,4,8}` 中,允许的 TP*DP product 为
|
||
`{1,2,4,8}`,本 case 中 EP 固定为 1。
|
||
|
||
2x2 arms:
|
||
|
||
| Arm | Tuner model | Harness | Trial budget used |
|
||
| --- | --- | --- | ---: |
|
||
| `gpt55_harness` | `gpt-5.5` | on | 2 |
|
||
| `gpt55_naive` | `gpt-5.5` | off | 10 |
|
||
| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
|
||
| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
|
||
|
||
同一个 tuner model 内,主要差异是 `use_harness`。跨模型比较则用来判断:
|
||
更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。
|
||
|
||
## Aggregate result
|
||
|
||
Reference best: `0.4429 req/s/GPU`。
|
||
Convergence target: reference 的 95%,即 `0.4208 req/s/GPU`。
|
||
|
||
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
|
||
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||
| `gpt55_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
|
||
| `gpt55_naive` | naive | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
|
||
| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
|
||
| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
|
||
|
||
Harness-vs-naive 检查全部通过:
|
||
|
||
| Harness arm | Final vs best naive | AUC vs best naive | Pass |
|
||
| --- | ---: | ---: | --- |
|
||
| `gpt55_harness` | 16.2290x | 16.1296x | true |
|
||
| `gpt54mini_harness` | 16.2290x | 16.0720x | true |
|
||
|
||
最关键的 ablation 信号是:`gpt-5.4-mini + harness` 和
|
||
`gpt-5.5 + harness` 达到同一个 final throughput,也都是 2 trials 达到 target;
|
||
而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。
|
||
|
||
## Agent loop 流程图
|
||
|
||
下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal,
|
||
但它拿到的不是裸文本历史,而是结构化 observation、bottleneck diagnosis、
|
||
candidate actions 和 validator 约束;同时 validator 可以授权 stop,也可以阻止
|
||
重复失败或不合法配置。
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config]
|
||
B --> C[Binary-search probes over sampling_u]
|
||
C --> D[Build observation o_t]
|
||
D --> E[Bottleneck classifier]
|
||
E --> F[Candidate family generator]
|
||
F --> G[Score candidate actions]
|
||
G --> H[Prompt renderer / planner]
|
||
H --> I[LLM or deterministic harness proposal]
|
||
I --> J{Config validator}
|
||
J -- invalid, repeated, unsafe --> F
|
||
J -- valid config_patch --> B
|
||
G --> K{Stop validator}
|
||
K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent]
|
||
K -- useful candidates remain --> H
|
||
```
|
||
|
||
这个 loop 中,harness 的作用不是把 prompt 写得更漂亮,而是把 tuning 变成
|
||
一个受测量约束的决策过程:
|
||
|
||
```text
|
||
measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop
|
||
```
|
||
|
||
## 形式化设计:observation
|
||
|
||
每个 trial 结束后,AITuner 不只记录一段自然语言总结,而是形成结构化 observation:
|
||
|
||
```text
|
||
o_t = (
|
||
config_t,
|
||
probe_history_t,
|
||
pass_rate_t,
|
||
latency/SLO_failure_profile_t,
|
||
request_rate_t,
|
||
parallel_size_t,
|
||
launch_status_t,
|
||
prior_failures_t,
|
||
incumbent_t
|
||
)
|
||
```
|
||
|
||
本实验里 observation 中最重要的字段是:
|
||
|
||
- `config_t`: 当前 trial 的 `flag_patch` 和 `env_patch`,例如 `TP=2, DP=1`。
|
||
- `probe_history_t`: 在不同 `sampling_u` 下二分探测得到的 feasible/infeasible
|
||
结果。
|
||
- `pass_rate_t`: 是否满足 target pass rate 0.95。
|
||
- `latency/SLO_failure_profile_t`: TTFT 和 TPOT 哪个先触发 SLO pressure。
|
||
- `request_rate_t`: 当前配置在 SLO 下能承载的 request rate。
|
||
- `parallel_size_t`: 该配置实际使用的并行规模,用于归一化 per-GPU objective。
|
||
- `prior_failures_t`: 之前哪些配置 launch failed 或 no feasible,避免重复试错。
|
||
- `incumbent_t`: 当前最优配置及其 `request_rate_per_gpu`。
|
||
|
||
目标函数是:
|
||
|
||
```text
|
||
J(config_t) = request_rate_t / parallel_size_t
|
||
subject to pass_rate_t >= 0.95
|
||
```
|
||
|
||
也就是说,harness 优化的是满足 SLO 后的 `req/s/GPU`,不是 raw throughput,
|
||
也不是 LLM 主观认为“更强”的配置。
|
||
|
||
## 形式化设计:bottleneck classifier
|
||
|
||
`bottleneck classifier` 把 observation 映射成 ranked bottleneck hypotheses:
|
||
|
||
```text
|
||
b_t = ranked_bottleneck(o_t)
|
||
```
|
||
|
||
它判断的不是“哪个 knob 看起来常用”,而是“当前 SLO failure 和 latency profile
|
||
说明哪个系统环节在限制 objective”。
|
||
|
||
常见分类包括:
|
||
|
||
| Bottleneck | 典型证据 | 倾向 knob family |
|
||
| --- | --- | --- |
|
||
| `ttft_prefill` | 长 prompt 下 TTFT 接近或超过 SLO,prefill service time 是瓶颈 | 提高 TP,调整 prefill batching |
|
||
| `decode_tpot` | TPOT p95/p99 超 SLO,decode token latency 是瓶颈 | 调整 `max-num-seqs`,提高 TP,降低 decode contention |
|
||
| `admission_queueing` | waiting/arrival lag 增长,服务时间未必单独变差 | 提高 DP,调整 admission/concurrency knobs |
|
||
| `memory_kv` | KV cache pressure、preemption、OOM、launch failure | 调整 `gpu-memory-utilization`、`block-size`、sequence/token caps |
|
||
| `topology_comm` | TP 增加降低 latency 但 per-GPU efficiency 下降 | 回退 TP,比较 DP/TP tradeoff |
|
||
|
||
本实验里,两个 harness arms 都把 ranked bottleneck 识别为
|
||
`ttft_prefill`。原因是 workload 有 heavy-tailed long prompts,并且 TTFT SLO 很紧;
|
||
这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica,
|
||
不能缩短一个长 prompt 的 prefill 路径,因此不是第一优先级。
|
||
|
||
## 形式化设计:candidate family
|
||
|
||
`candidate family generator` 根据 bottleneck 和 topology constraints 生成可比较的
|
||
action family:
|
||
|
||
```text
|
||
A_t = candidate_knob_families(
|
||
b_t,
|
||
topology_constraints,
|
||
prior_failures_t,
|
||
incumbent_t
|
||
)
|
||
```
|
||
|
||
在这个 case 中:
|
||
|
||
- `b_t = ttft_prefill`。
|
||
- 允许的 TP frontier 是 `TP=1 -> TP=2 -> TP=4 -> TP=8`。
|
||
- 允许的 DP frontier 是 `DP=1,2,4,8`,但 DP-only 不直接缓解单请求 prefill
|
||
latency。
|
||
- EP 固定为 1,因此不探索 expert parallel。
|
||
- 之前没有 failed topology,因此相邻 TP probe launch risk 低。
|
||
|
||
所以 harness 选择了:
|
||
|
||
```text
|
||
trial-0001: TP=2, DP=1
|
||
trial-0002: TP=4, DP=1
|
||
```
|
||
|
||
这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是
|
||
`admission_queueing`,candidate family 会更偏向 DP 或 `max-num-seqs`;如果输出是
|
||
`memory_kv`,则会更偏向 memory/cache/sequence knobs。
|
||
|
||
## 形式化设计:scoring
|
||
|
||
每个 candidate action 都按同一个抽象打分:
|
||
|
||
```text
|
||
score(a) = expected_bottleneck_relief(a)
|
||
+ information_gain(a)
|
||
+ launch_safety(a)
|
||
- regression_risk(a)
|
||
- measurement_cost(a)
|
||
```
|
||
|
||
这些项在本实验里的含义是:
|
||
|
||
- `expected_bottleneck_relief`: TP2/TP4 预计能降低 long-prefill compute latency,
|
||
直接作用于 `ttft_prefill`。
|
||
- `information_gain`: TP frontier probe 可以区分“需要 compute-latency relief”
|
||
还是“只是 admission/replica 不够”。
|
||
- `launch_safety`: TP2/TP4 均满足 topology constraints,没有重复 failed signature。
|
||
- `regression_risk`: TP 增加会带来通信开销,可能损害 per-GPU efficiency,所以必须用
|
||
`request_rate_per_gpu` 验证。
|
||
- `measurement_cost`: 每个 GPU trial 成本高;因此高信息量的 topology probe 优先于
|
||
多个局部 runtime tweak。
|
||
|
||
实际结果验证了这个 scoring:
|
||
|
||
| Arm | Trial | Patch | req/s/GPU | Pass rate | 解释 |
|
||
| --- | ---: | --- | ---: | ---: | --- |
|
||
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | 相邻 TP probe 已满足 SLO,但仍未饱和 search high。 |
|
||
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | TP frontier 继续缓解 prefill bottleneck,达到 reference best。 |
|
||
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | 弱模型也选择同一机制路径。 |
|
||
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | 弱模型加 harness 匹配强模型加 harness。 |
|
||
|
||
## 形式化设计:validator stop
|
||
|
||
Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 `stop validator`:
|
||
|
||
```text
|
||
stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false
|
||
```
|
||
|
||
本实验里 stop 的记录是:
|
||
|
||
```text
|
||
tuning_stop_reason: harness_stop
|
||
validator_reason: search_high_saturated_by_incumbent
|
||
diagnosis: The incumbent's highest measured probe is feasible and is within the
|
||
configured binary-search resolution of search.high.
|
||
```
|
||
|
||
含义是:
|
||
|
||
1. 当前 incumbent 的最高测量 probe 已经 feasible。
|
||
2. 该 feasible probe 距离 `search.high` 已经在 binary-search tolerance 内。
|
||
3. 在当前搜索区间和 SLO 约束下,继续花 GPU trial 很难提高 measured objective。
|
||
4. 因此 validator 授权 stop,并保留当前 incumbent。
|
||
|
||
这给 harness 带来了 stop discipline:它既不会因为 LLM 过早自信而随便停,也不会在
|
||
已经 saturate search high 后继续 burn budget。
|
||
|
||
## 实际 tune 了哪些 knobs
|
||
|
||
Harness winning path 只改了 topology:
|
||
|
||
```text
|
||
base config + tensor-parallel-size=4, data-parallel-size=1
|
||
```
|
||
|
||
它没有在 winning path 中调 scheduler/cache/memory knobs,因为 `ttft_prefill`
|
||
bottleneck 下,首要动作是缩短单请求 prefill service time。
|
||
|
||
Naive arms 则走了另一个方向:
|
||
|
||
| Arm | 所有 trials 使用的 topology | 变化过的 runtime knobs | Best req/s/GPU |
|
||
| --- | --- | --- | ---: |
|
||
| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
|
||
| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
|
||
|
||
`gpt55_naive` 的第一个 proposal 明确选择 `TP=1, DP=8`,理由是模型能单卡放下,
|
||
因此 horizontal data parallelism 应该最大化 request rate,而 TP 会带来通信开销。
|
||
之后 naive proposals 一直保留 DP-heavy topology,只围绕 runtime knobs 搜索。
|
||
两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。
|
||
|
||
## 为什么比 baseline 更好
|
||
|
||
Baseline 失败的原因是优化了错误的因果路径。
|
||
|
||
对 `ttft_prefill`-bound workload,关键服务时间是单个请求的 prefill latency。
|
||
DP-heavy topology 可以增加 replica 数,但每个 replica 仍用 TP1 处理长 prompt;
|
||
它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下,这会导致 feasible
|
||
`sampling_u` 很低;再除以 GPU 数得到 `req/s/GPU` 后,结果只有
|
||
`0.02-0.027 req/s/GPU`。
|
||
|
||
Harness 的优化路径是:
|
||
|
||
```text
|
||
observed SLO pressure
|
||
-> classify as ttft_prefill
|
||
-> choose legal TP frontier probe
|
||
-> measure feasible req/s/GPU under the same SLO
|
||
-> stop only when search.high is saturated by incumbent
|
||
```
|
||
|
||
这条路径是可测量、可反驳的。如果 TP4 降低了 latency 但
|
||
`request_rate_per_gpu` 明显下降,harness 会 reject 这个 hypothesis。如果
|
||
bottleneck 是 admission/queueing 而不是 TTFT/prefill,同一个 knob-effect model
|
||
会偏向 DP 或 `max-num-seqs`,而不是 TP frontier。
|
||
|
||
因此,这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是:
|
||
harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family,
|
||
再用 per-GPU objective 和 validator stop 验证这个方向。
|
||
|
||
## 证据边界
|
||
|
||
这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制,但不能单独当作通用性证明。
|
||
当前可成立的结论是:
|
||
|
||
- 在这个 case 中,harness 同时提升了 final quality、convergence speed、AUC 和
|
||
stop discipline。
|
||
- `gpt-5.4-mini + harness` 匹配 `gpt-5.5 + harness`,并显著超过
|
||
`gpt-5.5 + naive`,说明收益主要来自 harness 的结构化状态和 validator,而不是
|
||
单纯来自更强模型。
|
||
- 成功路径用的是通用机制:SLO-derived bottleneck classification、topology
|
||
constraints、knob-effect scoring、per-GPU objective、validator-authorized stop。
|
||
- 还需要在其他 bottleneck/case 上继续验证,例如 prefill scheduler pressure、
|
||
decode TPOT pressure、memory/KV pressure、admission/queueing pressure。
|
||
|
||
## 原始 aggregate report 摘录
|
||
|
||
```text
|
||
# qwen27b-tight-2x2-aggregate-20260623T005838Z
|
||
|
||
## Aggregate
|
||
|
||
- Cases: `1`
|
||
- Harness-vs-naive pass/checks: `2`/`2`
|
||
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`
|
||
|
||
## By Kind
|
||
|
||
| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
|
||
| --- | ---: | ---: | ---: | ---: |
|
||
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
|
||
| `naive` | 2 | 0.0569 | 0.0543 | 0 |
|
||
|
||
## Cases
|
||
|
||
### qwen27b-tight-slo-2x2-aggregate
|
||
|
||
- Reference best req/s/GPU: `0.4429`
|
||
- Target fraction: `0.95`
|
||
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`
|
||
|
||
| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
|
||
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
|
||
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
|
||
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
|
||
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
|
||
|
||
| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
|
||
| --- | ---: | ---: | ---: | --- |
|
||
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
|
||
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |
|
||
```
|