Files

Gahow Wang 861d754f29 Localize Qwen27B harness ablation doc

2026-06-23 18:14:35 +08:00

15 KiB

Raw Blame History

Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23

本文整理以下 aggregate report，并解释 harness 为什么能够让 tuning 更快、更有效：

.aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md

这个实验是一个 2x2 ablation：模型强弱和是否启用 use_harness 交叉。核心问题是：harness 是否提供了可复用的搜索结构，而不仅仅是更强 LLM 或者更长 prompt 带来的偶然收益。

实验设计

Case: qwen27b-tight-slo-2x2-aggregate。

实验基座：

Served model: qwen3.5-27b-256k-0223-internal。
Hardware: H20，最多 8 GPUs。
Trace: chat_w20260311_1000，输入长度过滤到 0-8192 tokens， replay_time_scale=1.0，max_concurrency=32。
SLO: pass rate >= 0.95；TTFT step rule 为 <=4096 input tokens 时 2s， <=32768 input tokens 时 4s，更长输入时 6s；TPOT <= 50 ms。
Search: 在 sampling_u in [0, 0.0625] 上二分探测，tolerance 0.001， max 6 probes。
Tunable envs: VLLM_ENABLE_TORCH_COMPILE。
Tunable flags: tensor-parallel-size, data-parallel-size, expert-parallel-size, gpu-memory-utilization, block-size, max-num-batched-tokens, max-num-seqs, enable-prefix-caching, enable-chunked-prefill。
Topology constraints: TP 和 DP 均在 {1,2,4,8} 中，允许的 TP*DP product 为 {1,2,4,8}，本 case 中 EP 固定为 1。

2x2 arms:

Arm	Tuner model	Harness	Trial budget used
`gpt55_harness`	`gpt-5.5`	on	2
`gpt55_naive`	`gpt-5.5`	off	10
`gpt54mini_harness`	`gpt-5.4-mini`	on	2
`gpt54mini_naive`	`gpt-5.4-mini`	off	10

同一个 tuner model 内，主要差异是 use_harness。跨模型比较则用来判断：更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。

Aggregate result

Reference best: 0.4429 req/s/GPU。 Convergence target: reference 的 95%，即 0.4208 req/s/GPU。

Arm	Kind	Trials	Final req/s/GPU	Final/ref	Trials to target	Normalized AUC	Failed	No feasible
`gpt55_harness`	harness	2	0.4429	1.0000	2	0.9484	0	0
`gpt55_naive`	naive	10	0.0273	0.0616	-	0.0588	2	2
`gpt54mini_harness`	harness	2	0.4429	1.0000	2	0.9450	0	0
`gpt54mini_naive`	naive	10	0.0231	0.0522	-	0.0498	1	1

Harness-vs-naive 检查全部通过：

Harness arm	Final vs best naive	AUC vs best naive	Pass
`gpt55_harness`	16.2290x	16.1296x	true
`gpt54mini_harness`	16.2290x	16.0720x	true

最关键的 ablation 信号是：gpt-5.4-mini + harness 和 gpt-5.5 + harness 达到同一个 final throughput，也都是 2 trials 达到 target；而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。

Agent loop 流程图

下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal，但它拿到的不是裸文本历史，而是结构化 observation、bottleneck diagnosis、 candidate actions 和 validator 约束；同时 validator 可以授权 stop，也可以阻止重复失败或不合法配置。

flowchart TD
    A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config]
    B --> C[Binary-search probes over sampling_u]
    C --> D[Build observation o_t]
    D --> E[Bottleneck classifier]
    E --> F[Candidate family generator]
    F --> G[Score candidate actions]
    G --> H[Prompt renderer / planner]
    H --> I[LLM or deterministic harness proposal]
    I --> J{Config validator}
    J -- invalid, repeated, unsafe --> F
    J -- valid config_patch --> B
    G --> K{Stop validator}
    K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent]
    K -- useful candidates remain --> H

这个 loop 中，harness 的作用不是把 prompt 写得更漂亮，而是把 tuning 变成一个受测量约束的决策过程：

measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop

形式化设计：observation

每个 trial 结束后，AITuner 不只记录一段自然语言总结，而是形成结构化 observation：

o_t = (
  config_t,
  probe_history_t,
  pass_rate_t,
  latency/SLO_failure_profile_t,
  request_rate_t,
  parallel_size_t,
  launch_status_t,
  prior_failures_t,
  incumbent_t
)

本实验里 observation 中最重要的字段是：

config_t: 当前 trial 的 flag_patch 和 env_patch，例如 TP=2, DP=1。
probe_history_t: 在不同 sampling_u 下二分探测得到的 feasible/infeasible 结果。
pass_rate_t: 是否满足 target pass rate 0.95。
latency/SLO_failure_profile_t: TTFT 和 TPOT 哪个先触发 SLO pressure。
request_rate_t: 当前配置在 SLO 下能承载的 request rate。
parallel_size_t: 该配置实际使用的并行规模，用于归一化 per-GPU objective。
prior_failures_t: 之前哪些配置 launch failed 或 no feasible，避免重复试错。
incumbent_t: 当前最优配置及其 request_rate_per_gpu。

目标函数是：

J(config_t) = request_rate_t / parallel_size_t
subject to pass_rate_t >= 0.95

也就是说，harness 优化的是满足 SLO 后的 req/s/GPU，不是 raw throughput，也不是 LLM 主观认为“更强”的配置。

形式化设计：bottleneck classifier

bottleneck classifier 把 observation 映射成 ranked bottleneck hypotheses：

b_t = ranked_bottleneck(o_t)

它判断的不是“哪个 knob 看起来常用”，而是“当前 SLO failure 和 latency profile 说明哪个系统环节在限制 objective”。

常见分类包括：

Bottleneck	典型证据	倾向 knob family
`ttft_prefill`	长 prompt 下 TTFT 接近或超过 SLO，prefill service time 是瓶颈	提高 TP，调整 prefill batching
`decode_tpot`	TPOT p95/p99 超 SLO，decode token latency 是瓶颈	调整 `max-num-seqs`，提高 TP，降低 decode contention
`admission_queueing`	waiting/arrival lag 增长，服务时间未必单独变差	提高 DP，调整 admission/concurrency knobs
`memory_kv`	KV cache pressure、preemption、OOM、launch failure	调整 `gpu-memory-utilization`、`block-size`、sequence/token caps
`topology_comm`	TP 增加降低 latency 但 per-GPU efficiency 下降	回退 TP，比较 DP/TP tradeoff

本实验里，两个 harness arms 都把 ranked bottleneck 识别为 ttft_prefill。原因是 workload 有 heavy-tailed long prompts，并且 TTFT SLO 很紧；这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica，不能缩短一个长 prompt 的 prefill 路径，因此不是第一优先级。

形式化设计：candidate family

candidate family generator 根据 bottleneck 和 topology constraints 生成可比较的 action family：

A_t = candidate_knob_families(
  b_t,
  topology_constraints,
  prior_failures_t,
  incumbent_t
)

在这个 case 中：

b_t = ttft_prefill。
允许的 TP frontier 是 TP=1 -> TP=2 -> TP=4 -> TP=8。
允许的 DP frontier 是 DP=1,2,4,8，但 DP-only 不直接缓解单请求 prefill latency。
EP 固定为 1，因此不探索 expert parallel。
之前没有 failed topology，因此相邻 TP probe launch risk 低。

所以 harness 选择了：

trial-0001: TP=2, DP=1
trial-0002: TP=4, DP=1

这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是 admission_queueing，candidate family 会更偏向 DP 或 max-num-seqs；如果输出是 memory_kv，则会更偏向 memory/cache/sequence knobs。

形式化设计：scoring

每个 candidate action 都按同一个抽象打分：

score(a) = expected_bottleneck_relief(a)
         + information_gain(a)
         + launch_safety(a)
         - regression_risk(a)
         - measurement_cost(a)

这些项在本实验里的含义是：

expected_bottleneck_relief: TP2/TP4 预计能降低 long-prefill compute latency，直接作用于 ttft_prefill。
information_gain: TP frontier probe 可以区分“需要 compute-latency relief” 还是“只是 admission/replica 不够”。
launch_safety: TP2/TP4 均满足 topology constraints，没有重复 failed signature。
regression_risk: TP 增加会带来通信开销，可能损害 per-GPU efficiency，所以必须用 request_rate_per_gpu 验证。
measurement_cost: 每个 GPU trial 成本高；因此高信息量的 topology probe 优先于多个局部 runtime tweak。

实际结果验证了这个 scoring：

Arm	Trial	Patch	req/s/GPU	Pass rate	解释
`gpt55_harness`	1	`TP=2, DP=1`	0.2142	0.9572	相邻 TP probe 已满足 SLO，但仍未饱和 search high。
`gpt55_harness`	2	`TP=4, DP=1`	0.4429	0.9718	TP frontier 继续缓解 prefill bottleneck，达到 reference best。
`gpt54mini_harness`	1	`TP=2, DP=1`	0.1992	0.9707	弱模型也选择同一机制路径。
`gpt54mini_harness`	2	`TP=4, DP=1`	0.4429	0.9727	弱模型加 harness 匹配强模型加 harness。

形式化设计：validator stop

Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 stop validator：

stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false

本实验里 stop 的记录是：

tuning_stop_reason: harness_stop
validator_reason: search_high_saturated_by_incumbent
diagnosis: The incumbent's highest measured probe is feasible and is within the
configured binary-search resolution of search.high.

含义是：

当前 incumbent 的最高测量 probe 已经 feasible。
该 feasible probe 距离 search.high 已经在 binary-search tolerance 内。
在当前搜索区间和 SLO 约束下，继续花 GPU trial 很难提高 measured objective。
因此 validator 授权 stop，并保留当前 incumbent。

这给 harness 带来了 stop discipline：它既不会因为 LLM 过早自信而随便停，也不会在已经 saturate search high 后继续 burn budget。

实际 tune 了哪些 knobs

Harness winning path 只改了 topology：

base config + tensor-parallel-size=4, data-parallel-size=1

它没有在 winning path 中调 scheduler/cache/memory knobs，因为 ttft_prefill bottleneck 下，首要动作是缩短单请求 prefill service time。

Naive arms 则走了另一个方向：

Arm	所有 trials 使用的 topology	变化过的 runtime knobs	Best req/s/GPU
`gpt55_naive`	`TP=1, DP=8`	`max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill	0.0273
`gpt54mini_naive`	`TP=1, DP=8`	`max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`	0.0231

gpt55_naive 的第一个 proposal 明确选择 TP=1, DP=8，理由是模型能单卡放下，因此 horizontal data parallelism 应该最大化 request rate，而 TP 会带来通信开销。之后 naive proposals 一直保留 DP-heavy topology，只围绕 runtime knobs 搜索。两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。

为什么比 baseline 更好

Baseline 失败的原因是优化了错误的因果路径。

对 ttft_prefill-bound workload，关键服务时间是单个请求的 prefill latency。 DP-heavy topology 可以增加 replica 数，但每个 replica 仍用 TP1 处理长 prompt；它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下，这会导致 feasible sampling_u 很低；再除以 GPU 数得到 req/s/GPU 后，结果只有 0.02-0.027 req/s/GPU。

Harness 的优化路径是：

observed SLO pressure
-> classify as ttft_prefill
-> choose legal TP frontier probe
-> measure feasible req/s/GPU under the same SLO
-> stop only when search.high is saturated by incumbent

这条路径是可测量、可反驳的。如果 TP4 降低了 latency 但 request_rate_per_gpu 明显下降，harness 会 reject 这个 hypothesis。如果 bottleneck 是 admission/queueing 而不是 TTFT/prefill，同一个 knob-effect model 会偏向 DP 或 max-num-seqs，而不是 TP frontier。

因此，这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是： harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family，再用 per-GPU objective 和 validator stop 验证这个方向。

证据边界

这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制，但不能单独当作通用性证明。当前可成立的结论是：

在这个 case 中，harness 同时提升了 final quality、convergence speed、AUC 和 stop discipline。
gpt-5.4-mini + harness 匹配 gpt-5.5 + harness，并显著超过 gpt-5.5 + naive，说明收益主要来自 harness 的结构化状态和 validator，而不是单纯来自更强模型。
成功路径用的是通用机制：SLO-derived bottleneck classification、topology constraints、knob-effect scoring、per-GPU objective、validator-authorized stop。
还需要在其他 bottleneck/case 上继续验证，例如 prefill scheduler pressure、 decode TPOT pressure、memory/KV pressure、admission/queueing pressure。

原始 aggregate report 摘录

# qwen27b-tight-2x2-aggregate-20260623T005838Z

## Aggregate

- Cases: `1`
- Harness-vs-naive pass/checks: `2`/`2`
- Winner counts: `{"final_best": {"gpt55_harness": 1}, "fastest_to_target": {"gpt55_harness": 1}, "normalized_auc": {"gpt55_harness": 1}}`

## By Kind

| Kind | Arms | Mean final/ref | Mean AUC | Target reached |
| --- | ---: | ---: | ---: | ---: |
| `harness` | 2 | 1.0000 | 0.9467 | 2 |
| `naive` | 2 | 0.0569 | 0.0543 | 0 |

## Cases

### qwen27b-tight-slo-2x2-aggregate

- Reference best req/s/GPU: `0.4429`
- Target fraction: `0.95`
- Winners: `{"final_best": "gpt55_harness", "fastest_to_target": "gpt55_harness", "normalized_auc": "gpt55_harness"}`

| Arm | Kind | Trials | Final/GPU | Final/ref | TTT | AUC | Failed | No feasible |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `gpt55_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9484 | 0 | 0 |
| `gpt55_naive` | `naive` | 10 | 0.0273 | 0.0616 | - | 0.0588 | 2 | 2 |
| `gpt54mini_harness` | `harness` | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
| `gpt54mini_naive` | `naive` | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |

| Harness | Final vs best naive | Target speedup | AUC vs best naive | Pass |
| --- | ---: | ---: | ---: | --- |
| `gpt55_harness` | 16.2290 | - | 16.1296 | `True` |
| `gpt54mini_harness` | 16.2290 | - | 16.0720 | `True` |

15 KiB Raw Blame History Unescape Escape