Localize Qwen27B harness ablation doc

This commit is contained in:
2026-06-23 18:14:35 +08:00
parent 76ec19224c
commit 861d754f29

View File

@@ -1,37 +1,38 @@
# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
# Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23
This note organizes the aggregate report generated at:
本文整理以下 aggregate report并解释 harness 为什么能够让 tuning 更快、更有效:
```text
.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
```
The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
It asks whether the harness supplies reusable search structure beyond a stronger
LLM's free-form tuning proposals.
这个实验是一个 2x2 ablation模型强弱和是否启用 `use_harness` 交叉。
核心问题是harness 是否提供了可复用的搜索结构,而不仅仅是更强 LLM
或者更长 prompt 带来的偶然收益。
## Experiment Design
## 实验设计
Case: `qwen27b-tight-slo-2x2-aggregate`.
Case: `qwen27b-tight-slo-2x2-aggregate`
Substrate:
实验基座:
- Model served: `qwen3.5-27b-256k-0223-internal`.
- Hardware: H20, up to 8 GPUs.
- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
`replay_time_scale=1.0`, `max_concurrency=32`.
- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
- Served model: `qwen3.5-27b-256k-0223-internal`
- Hardware: H20,最多 8 GPUs
- Trace: `chat_w20260311_1000`,输入长度过滤到 0-8192 tokens
`replay_time_scale=1.0``max_concurrency=32`
- SLO: pass rate >= 0.95TTFT step rule <=4096 input tokens 时 2s
<=32768 input tokens 时 4s更长输入时 6sTPOT <= 50 ms
- Search: `sampling_u in [0, 0.0625]` 上二分探测,tolerance 0.001
max 6 probes。
- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`
- Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
`expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
`max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
`enable-chunked-prefill`.
- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
`{1,2,4,8}`, EP fixed to 1 for this case.
`enable-chunked-prefill`
- Topology constraints: TP DP 均在 `{1,2,4,8}` 中,允许的 TP*DP product
`{1,2,4,8}`,本 case 中 EP 固定为 1。
Arms:
2x2 arms:
| Arm | Tuner model | Harness | Trial budget used |
| --- | --- | --- | ---: |
@@ -40,15 +41,13 @@ Arms:
| `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
| `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
The only intended axis inside each model pair is `use_harness`. The aggregate
then compares whether the weaker model plus harness can match or exceed the
stronger model without harness.
同一个 tuner model 内,主要差异是 `use_harness`。跨模型比较则用来判断:
更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。
## Aggregate Result
## Aggregate result
Reference best: `0.4429 req/s/GPU`.
Target threshold for convergence comparisons: 95% of reference, or
`0.4208 req/s/GPU`.
Reference best: `0.4429 req/s/GPU`
Convergence target: reference 的 95%,即 `0.4208 req/s/GPU`
| Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
@@ -57,42 +56,154 @@ Target threshold for convergence comparisons: 95% of reference, or
| `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
Harness wins both harness-vs-naive checks:
Harness-vs-naive 检查全部通过:
| Harness arm | Final vs best naive | AUC vs best naive | Pass |
| --- | ---: | ---: | --- |
| `gpt55_harness` | 16.2290x | 16.1296x | true |
| `gpt54mini_harness` | 16.2290x | 16.0720x | true |
The strongest ablation observation is that `gpt-5.4-mini + harness` matches
`gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
while both naive arms remain more than 16x below the harness arms by final
per-GPU throughput and AUC.
最关键的 ablation 信号是:`gpt-5.4-mini + harness`
`gpt-5.5 + harness` 达到同一个 final throughput,也都是 2 trials 达到 target
而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。
## What The Harness Actually Did
## Agent loop 流程图
The harness did not perform generic "better prompting". It inserted a measured,
structured decision protocol between trial results and the next proposal.
下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal
但它拿到的不是裸文本历史,而是结构化 observation、bottleneck diagnosis、
candidate actions 和 validator 约束;同时 validator 可以授权 stop也可以阻止
重复失败或不合法配置。
Formally, after each trial `t`, AITuner observes:
```text
o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
request_rate_t, parallel_size_t, launch status_t)
```mermaid
flowchart TD
A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config]
B --> C[Binary-search probes over sampling_u]
C --> D[Build observation o_t]
D --> E[Bottleneck classifier]
E --> F[Candidate family generator]
F --> G[Score candidate actions]
G --> H[Prompt renderer / planner]
H --> I[LLM or deterministic harness proposal]
I --> J{Config validator}
J -- invalid, repeated, unsafe --> F
J -- valid config_patch --> B
G --> K{Stop validator}
K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent]
K -- useful candidates remain --> H
```
and optimizes:
这个 loop 中harness 的作用不是把 prompt 写得更漂亮,而是把 tuning 变成
一个受测量约束的决策过程:
```text
measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop
```
## 形式化设计observation
每个 trial 结束后AITuner 不只记录一段自然语言总结,而是形成结构化 observation
```text
o_t = (
config_t,
probe_history_t,
pass_rate_t,
latency/SLO_failure_profile_t,
request_rate_t,
parallel_size_t,
launch_status_t,
prior_failures_t,
incumbent_t
)
```
本实验里 observation 中最重要的字段是:
- `config_t`: 当前 trial 的 `flag_patch``env_patch`,例如 `TP=2, DP=1`
- `probe_history_t`: 在不同 `sampling_u` 下二分探测得到的 feasible/infeasible
结果。
- `pass_rate_t`: 是否满足 target pass rate 0.95。
- `latency/SLO_failure_profile_t`: TTFT 和 TPOT 哪个先触发 SLO pressure。
- `request_rate_t`: 当前配置在 SLO 下能承载的 request rate。
- `parallel_size_t`: 该配置实际使用的并行规模,用于归一化 per-GPU objective。
- `prior_failures_t`: 之前哪些配置 launch failed 或 no feasible避免重复试错。
- `incumbent_t`: 当前最优配置及其 `request_rate_per_gpu`
目标函数是:
```text
J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95.
subject to pass_rate_t >= 0.95
```
The harness maps the observation into:
也就是说,harness 优化的是满足 SLO 后的 `req/s/GPU`,不是 raw throughput
也不是 LLM 主观认为“更强”的配置。
## 形式化设计bottleneck classifier
`bottleneck classifier` 把 observation 映射成 ranked bottleneck hypotheses
```text
b_t = ranked_bottleneck(o_t)
A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
```
它判断的不是“哪个 knob 看起来常用”,而是“当前 SLO failure 和 latency profile
说明哪个系统环节在限制 objective”。
常见分类包括:
| Bottleneck | 典型证据 | 倾向 knob family |
| --- | --- | --- |
| `ttft_prefill` | 长 prompt 下 TTFT 接近或超过 SLOprefill service time 是瓶颈 | 提高 TP调整 prefill batching |
| `decode_tpot` | TPOT p95/p99 超 SLOdecode token latency 是瓶颈 | 调整 `max-num-seqs`,提高 TP降低 decode contention |
| `admission_queueing` | waiting/arrival lag 增长,服务时间未必单独变差 | 提高 DP调整 admission/concurrency knobs |
| `memory_kv` | KV cache pressure、preemption、OOM、launch failure | 调整 `gpu-memory-utilization``block-size`、sequence/token caps |
| `topology_comm` | TP 增加降低 latency 但 per-GPU efficiency 下降 | 回退 TP比较 DP/TP tradeoff |
本实验里,两个 harness arms 都把 ranked bottleneck 识别为
`ttft_prefill`。原因是 workload 有 heavy-tailed long prompts并且 TTFT SLO 很紧;
这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica
不能缩短一个长 prompt 的 prefill 路径,因此不是第一优先级。
## 形式化设计candidate family
`candidate family generator` 根据 bottleneck 和 topology constraints 生成可比较的
action family
```text
A_t = candidate_knob_families(
b_t,
topology_constraints,
prior_failures_t,
incumbent_t
)
```
在这个 case 中:
- `b_t = ttft_prefill`
- 允许的 TP frontier 是 `TP=1 -> TP=2 -> TP=4 -> TP=8`
- 允许的 DP frontier 是 `DP=1,2,4,8`,但 DP-only 不直接缓解单请求 prefill
latency。
- EP 固定为 1因此不探索 expert parallel。
- 之前没有 failed topology因此相邻 TP probe launch risk 低。
所以 harness 选择了:
```text
trial-0001: TP=2, DP=1
trial-0002: TP=4, DP=1
```
这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是
`admission_queueing`candidate family 会更偏向 DP 或 `max-num-seqs`;如果输出是
`memory_kv`,则会更偏向 memory/cache/sequence knobs。
## 形式化设计scoring
每个 candidate action 都按同一个抽象打分:
```text
score(a) = expected_bottleneck_relief(a)
+ information_gain(a)
+ launch_safety(a)
@@ -100,118 +211,122 @@ score(a) = expected_bottleneck_relief(a)
- measurement_cost(a)
```
For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
prompts and a tight TTFT SLO made single-request prefill service time the
active limiter. Under that bottleneck, the high-value candidate family is a
legal TP frontier probe, because increasing TP can reduce prefill compute
latency for one request. DP-only scaling adds replicas but does not shorten the
single-request prefill path, so it can improve aggregate admission while still
failing the per-request TTFT bottleneck and the per-GPU objective.
这些项在本实验里的含义是:
The actual harness trajectory was:
- `expected_bottleneck_relief`: TP2/TP4 预计能降低 long-prefill compute latency
直接作用于 `ttft_prefill`
- `information_gain`: TP frontier probe 可以区分“需要 compute-latency relief”
还是“只是 admission/replica 不够”。
- `launch_safety`: TP2/TP4 均满足 topology constraints没有重复 failed signature。
- `regression_risk`: TP 增加会带来通信开销,可能损害 per-GPU efficiency所以必须用
`request_rate_per_gpu` 验证。
- `measurement_cost`: 每个 GPU trial 成本高;因此高信息量的 topology probe 优先于
多个局部 runtime tweak。
| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
实际结果验证了这个 scoring
| Arm | Trial | Patch | req/s/GPU | Pass rate | 解释 |
| --- | ---: | --- | ---: | ---: | --- |
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | 相邻 TP probe 已满足 SLO但仍未饱和 search high。 |
| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | TP frontier 继续缓解 prefill bottleneck达到 reference best。 |
| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | 弱模型也选择同一机制路径。 |
| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | 弱模型加 harness 匹配强模型加 harness。 |
The stop was also harness-mediated. Both harness arms stopped after trial 2
because the validator authorized `harness_stop` with:
## 形式化设计validator stop
Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 `stop validator`
```text
search_high_saturated_by_incumbent
stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false
```
The recorded stop diagnosis was:
本实验里 stop 的记录是:
```text
The incumbent's highest measured probe is feasible and is within the configured
binary-search resolution of search.high.
tuning_stop_reason: harness_stop
validator_reason: search_high_saturated_by_incumbent
diagnosis: The incumbent's highest measured probe is feasible and is within the
configured binary-search resolution of search.high.
```
So the loop did not stop because an LLM guessed that tuning was done. It stopped
because the incumbent saturated the configured search interval under the SLO
within binary-search tolerance.
含义是:
## Which Knobs Were Tuned
1. 当前 incumbent 的最高测量 probe 已经 feasible。
2. 该 feasible probe 距离 `search.high` 已经在 binary-search tolerance 内。
3. 在当前搜索区间和 SLO 约束下,继续花 GPU trial 很难提高 measured objective。
4. 因此 validator 授权 stop并保留当前 incumbent。
The winning harness configuration only changed topology:
这给 harness 带来了 stop discipline它既不会因为 LLM 过早自信而随便停,也不会在
已经 saturate search high 后继续 burn budget。
## 实际 tune 了哪些 knobs
Harness winning path 只改了 topology
```text
base config + tensor-parallel-size=4, data-parallel-size=1
```
The harness did not tune local scheduler/cache/memory knobs in the winning path.
It deliberately tested topology before local runtime knobs because the active
bottleneck was single-request TTFT/prefill service time.
它没有在 winning path 中调 scheduler/cache/memory knobs,因为 `ttft_prefill`
bottleneck 下,首要动作是缩短单请求 prefill service time。
The naive arms tuned a different knob family:
Naive arms 则走了另一个方向:
| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
| Arm | 所有 trials 使用的 topology | 变化过的 runtime knobs | Best req/s/GPU |
| --- | --- | --- | ---: |
| `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
| `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
horizontal data parallelism should maximize request rate because the model fits
per GPU and TP would add communication overhead. Subsequent naive proposals kept
that DP-heavy topology and searched scheduler/cache/memory details around it.
Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
frontier that solved the bottleneck.
`gpt55_naive` 的第一个 proposal 明确选择 `TP=1, DP=8`,理由是模型能单卡放下,
因此 horizontal data parallelism 应该最大化 request rate而 TP 会带来通信开销。
之后 naive proposals 一直保留 DP-heavy topology只围绕 runtime knobs 搜索。
两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。
## Why This Beats Baseline
## 为什么比 baseline 更好
The baseline failed because it optimized the wrong causal path.
Baseline 失败的原因是优化了错误的因果路径。
For a TTFT/prefill-bound workload, the relevant service-time term is the latency
of one request's prefill path. A DP-heavy topology can run more independent
replicas, but each replica still handles a long prompt with TP1 compute latency.
Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
feasible `sampling_u`, and the objective divides by GPU usage. This is why
`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
`ttft_prefill`-bound workload,关键服务时间是单个请求的 prefill latency
DP-heavy topology 可以增加 replica 数,但每个 replica 仍用 TP1 处理长 prompt
它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下,这会导致 feasible
`sampling_u` 很低;再除以 GPU 数得到 `req/s/GPU` 后,结果只有
`0.02-0.027 req/s/GPU`
The harness changed the optimization direction:
Harness 的优化路径是:
```text
observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
-> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
observed SLO pressure
-> classify as ttft_prefill
-> choose legal TP frontier probe
-> measure feasible req/s/GPU under the same SLO
-> stop only when search.high is saturated by incumbent
```
That sequence is measurable and falsifiable. If TP4 had improved raw latency but
materially regressed `request_rate_per_gpu`, the harness proposal said it should
reject the hypothesis. If the bottleneck had been admission/queueing with healthy
TTFT/TPOT service times, the same knob-effect model would have favored DP or
`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
"`ttft_prefill` evidence makes TP frontier the next highest-information probe
under current constraints."
这条路径是可测量、可反驳的。如果 TP4 降低了 latency
`request_rate_per_gpu` 明显下降,harness 会 reject 这个 hypothesis。如果
bottleneck admission/queueing 而不是 TTFT/prefill同一个 knob-effect model
会偏向 DP 或 `max-num-seqs`,而不是 TP frontier。
This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
harness converged to exactly the same TP frontier and final throughput as
`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
wrong DP-heavy family for its whole budget. The ablation therefore attributes the
gain to the structured harness state and validators, not merely to a stronger
language model or a more verbose prompt.
因此这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是
harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family
再用 per-GPU objective 和 validator stop 验证这个方向。
## Evidence Boundary
## 证据边界
This report strongly supports the harness mechanism on the Qwen27B tight-SLO
case and the model-strength ablation. It should not be overclaimed as universal
proof by itself. The correct generalization claim is narrower:
这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制,但不能单独当作通用性证明。
当前可成立的结论是:
- In this case, the harness improved final quality, convergence speed, AUC, and
stop discipline.
- The harness made a weaker model match the stronger harnessed model and beat
the stronger naive model by more than 16x.
- The successful decision was expressed in generic terms: SLO-derived
bottleneck classification, topology constraints, knob-effect scoring,
per-GPU objective, and validator-authorized stop.
- Additional cases are still needed to show the same mechanism across different
bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
memory/KV pressure, and admission/queueing pressure.
- 在这个 case 中harness 同时提升了 final qualityconvergence speedAUC
stop discipline
- `gpt-5.4-mini + harness` 匹配 `gpt-5.5 + harness`,并显著超过
`gpt-5.5 + naive`,说明收益主要来自 harness 的结构化状态和 validator而不是
单纯来自更强模型。
- 成功路径用的是通用机制SLO-derived bottleneck classificationtopology
constraints、knob-effect scoring、per-GPU objectivevalidator-authorized stop
- 还需要在其他 bottleneck/case 上继续验证,例如 prefill scheduler pressure、
decode TPOT pressure、memory/KV pressure、admission/queueing pressure
## Original Aggregate Report
## 原始 aggregate report 摘录
```text
# qwen27b-tight-2x2-aggregate-20260623T005838Z