From 861d754f29e7385a1a587dda04364224328b841c Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Tue, 23 Jun 2026 18:14:35 +0800 Subject: [PATCH] Localize Qwen27B harness ablation doc --- ...en27b-tight-2x2-model-ablation-20260623.md | 351 ++++++++++++------ 1 file changed, 233 insertions(+), 118 deletions(-) diff --git a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md index dab994c..fbd0db7 100644 --- a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md +++ b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md @@ -1,37 +1,38 @@ -# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23 +# Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23 -This note organizes the aggregate report generated at: +本文整理以下 aggregate report,并解释 harness 为什么能够让 tuning 更快、更有效: ```text .aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md ``` -The experiment is a 2x2 ablation: model strength crossed with `use_harness`. -It asks whether the harness supplies reusable search structure beyond a stronger -LLM's free-form tuning proposals. +这个实验是一个 2x2 ablation:模型强弱和是否启用 `use_harness` 交叉。 +核心问题是:harness 是否提供了可复用的搜索结构,而不仅仅是更强 LLM +或者更长 prompt 带来的偶然收益。 -## Experiment Design +## 实验设计 -Case: `qwen27b-tight-slo-2x2-aggregate`. +Case: `qwen27b-tight-slo-2x2-aggregate`。 -Substrate: +实验基座: -- Model served: `qwen3.5-27b-256k-0223-internal`. -- Hardware: H20, up to 8 GPUs. -- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens, - `replay_time_scale=1.0`, `max_concurrency=32`. -- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens, - 4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms. -- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes. -- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`. +- Served model: `qwen3.5-27b-256k-0223-internal`。 +- Hardware: H20,最多 8 GPUs。 +- Trace: `chat_w20260311_1000`,输入长度过滤到 0-8192 tokens, + `replay_time_scale=1.0`,`max_concurrency=32`。 +- SLO: pass rate >= 0.95;TTFT step rule 为 <=4096 input tokens 时 2s, + <=32768 input tokens 时 4s,更长输入时 6s;TPOT <= 50 ms。 +- Search: 在 `sampling_u in [0, 0.0625]` 上二分探测,tolerance 0.001, + max 6 probes。 +- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`。 - Tunable flags: `tensor-parallel-size`, `data-parallel-size`, `expert-parallel-size`, `gpu-memory-utilization`, `block-size`, `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`, - `enable-chunked-prefill`. -- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in - `{1,2,4,8}`, EP fixed to 1 for this case. + `enable-chunked-prefill`。 +- Topology constraints: TP 和 DP 均在 `{1,2,4,8}` 中,允许的 TP*DP product 为 + `{1,2,4,8}`,本 case 中 EP 固定为 1。 -Arms: +2x2 arms: | Arm | Tuner model | Harness | Trial budget used | | --- | --- | --- | ---: | @@ -40,15 +41,13 @@ Arms: | `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 | | `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 | -The only intended axis inside each model pair is `use_harness`. The aggregate -then compares whether the weaker model plus harness can match or exceed the -stronger model without harness. +同一个 tuner model 内,主要差异是 `use_harness`。跨模型比较则用来判断: +更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。 -## Aggregate Result +## Aggregate result -Reference best: `0.4429 req/s/GPU`. -Target threshold for convergence comparisons: 95% of reference, or -`0.4208 req/s/GPU`. +Reference best: `0.4429 req/s/GPU`。 +Convergence target: reference 的 95%,即 `0.4208 req/s/GPU`。 | Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible | | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | @@ -57,42 +56,154 @@ Target threshold for convergence comparisons: 95% of reference, or | `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 | | `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 | -Harness wins both harness-vs-naive checks: +Harness-vs-naive 检查全部通过: | Harness arm | Final vs best naive | AUC vs best naive | Pass | | --- | ---: | ---: | --- | | `gpt55_harness` | 16.2290x | 16.1296x | true | | `gpt54mini_harness` | 16.2290x | 16.0720x | true | -The strongest ablation observation is that `gpt-5.4-mini + harness` matches -`gpt-5.5 + harness` at the same final throughput and the same trials-to-target, -while both naive arms remain more than 16x below the harness arms by final -per-GPU throughput and AUC. +最关键的 ablation 信号是:`gpt-5.4-mini + harness` 和 +`gpt-5.5 + harness` 达到同一个 final throughput,也都是 2 trials 达到 target; +而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。 -## What The Harness Actually Did +## Agent loop 流程图 -The harness did not perform generic "better prompting". It inserted a measured, -structured decision protocol between trial results and the next proposal. +下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal, +但它拿到的不是裸文本历史,而是结构化 observation、bottleneck diagnosis、 +candidate actions 和 validator 约束;同时 validator 可以授权 stop,也可以阻止 +重复失败或不合法配置。 -Formally, after each trial `t`, AITuner observes: - -```text -o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t, - request_rate_t, parallel_size_t, launch status_t) +```mermaid +flowchart TD + A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config] + B --> C[Binary-search probes over sampling_u] + C --> D[Build observation o_t] + D --> E[Bottleneck classifier] + E --> F[Candidate family generator] + F --> G[Score candidate actions] + G --> H[Prompt renderer / planner] + H --> I[LLM or deterministic harness proposal] + I --> J{Config validator} + J -- invalid, repeated, unsafe --> F + J -- valid config_patch --> B + G --> K{Stop validator} + K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent] + K -- useful candidates remain --> H ``` -and optimizes: +这个 loop 中,harness 的作用不是把 prompt 写得更漂亮,而是把 tuning 变成 +一个受测量约束的决策过程: + +```text +measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop +``` + +## 形式化设计:observation + +每个 trial 结束后,AITuner 不只记录一段自然语言总结,而是形成结构化 observation: + +```text +o_t = ( + config_t, + probe_history_t, + pass_rate_t, + latency/SLO_failure_profile_t, + request_rate_t, + parallel_size_t, + launch_status_t, + prior_failures_t, + incumbent_t +) +``` + +本实验里 observation 中最重要的字段是: + +- `config_t`: 当前 trial 的 `flag_patch` 和 `env_patch`,例如 `TP=2, DP=1`。 +- `probe_history_t`: 在不同 `sampling_u` 下二分探测得到的 feasible/infeasible + 结果。 +- `pass_rate_t`: 是否满足 target pass rate 0.95。 +- `latency/SLO_failure_profile_t`: TTFT 和 TPOT 哪个先触发 SLO pressure。 +- `request_rate_t`: 当前配置在 SLO 下能承载的 request rate。 +- `parallel_size_t`: 该配置实际使用的并行规模,用于归一化 per-GPU objective。 +- `prior_failures_t`: 之前哪些配置 launch failed 或 no feasible,避免重复试错。 +- `incumbent_t`: 当前最优配置及其 `request_rate_per_gpu`。 + +目标函数是: ```text J(config_t) = request_rate_t / parallel_size_t -subject to pass_rate_t >= 0.95. +subject to pass_rate_t >= 0.95 ``` -The harness maps the observation into: +也就是说,harness 优化的是满足 SLO 后的 `req/s/GPU`,不是 raw throughput, +也不是 LLM 主观认为“更强”的配置。 + +## 形式化设计:bottleneck classifier + +`bottleneck classifier` 把 observation 映射成 ranked bottleneck hypotheses: ```text b_t = ranked_bottleneck(o_t) -A_t = candidate_knob_families(b_t, topology_constraints, prior_failures) +``` + +它判断的不是“哪个 knob 看起来常用”,而是“当前 SLO failure 和 latency profile +说明哪个系统环节在限制 objective”。 + +常见分类包括: + +| Bottleneck | 典型证据 | 倾向 knob family | +| --- | --- | --- | +| `ttft_prefill` | 长 prompt 下 TTFT 接近或超过 SLO,prefill service time 是瓶颈 | 提高 TP,调整 prefill batching | +| `decode_tpot` | TPOT p95/p99 超 SLO,decode token latency 是瓶颈 | 调整 `max-num-seqs`,提高 TP,降低 decode contention | +| `admission_queueing` | waiting/arrival lag 增长,服务时间未必单独变差 | 提高 DP,调整 admission/concurrency knobs | +| `memory_kv` | KV cache pressure、preemption、OOM、launch failure | 调整 `gpu-memory-utilization`、`block-size`、sequence/token caps | +| `topology_comm` | TP 增加降低 latency 但 per-GPU efficiency 下降 | 回退 TP,比较 DP/TP tradeoff | + +本实验里,两个 harness arms 都把 ranked bottleneck 识别为 +`ttft_prefill`。原因是 workload 有 heavy-tailed long prompts,并且 TTFT SLO 很紧; +这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica, +不能缩短一个长 prompt 的 prefill 路径,因此不是第一优先级。 + +## 形式化设计:candidate family + +`candidate family generator` 根据 bottleneck 和 topology constraints 生成可比较的 +action family: + +```text +A_t = candidate_knob_families( + b_t, + topology_constraints, + prior_failures_t, + incumbent_t +) +``` + +在这个 case 中: + +- `b_t = ttft_prefill`。 +- 允许的 TP frontier 是 `TP=1 -> TP=2 -> TP=4 -> TP=8`。 +- 允许的 DP frontier 是 `DP=1,2,4,8`,但 DP-only 不直接缓解单请求 prefill + latency。 +- EP 固定为 1,因此不探索 expert parallel。 +- 之前没有 failed topology,因此相邻 TP probe launch risk 低。 + +所以 harness 选择了: + +```text +trial-0001: TP=2, DP=1 +trial-0002: TP=4, DP=1 +``` + +这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是 +`admission_queueing`,candidate family 会更偏向 DP 或 `max-num-seqs`;如果输出是 +`memory_kv`,则会更偏向 memory/cache/sequence knobs。 + +## 形式化设计:scoring + +每个 candidate action 都按同一个抽象打分: + +```text score(a) = expected_bottleneck_relief(a) + information_gain(a) + launch_safety(a) @@ -100,118 +211,122 @@ score(a) = expected_bottleneck_relief(a) - measurement_cost(a) ``` -For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed -prompts and a tight TTFT SLO made single-request prefill service time the -active limiter. Under that bottleneck, the high-value candidate family is a -legal TP frontier probe, because increasing TP can reduce prefill compute -latency for one request. DP-only scaling adds replicas but does not shorten the -single-request prefill path, so it can improve aggregate admission while still -failing the per-request TTFT bottleneck and the per-GPU objective. +这些项在本实验里的含义是: -The actual harness trajectory was: +- `expected_bottleneck_relief`: TP2/TP4 预计能降低 long-prefill compute latency, + 直接作用于 `ttft_prefill`。 +- `information_gain`: TP frontier probe 可以区分“需要 compute-latency relief” + 还是“只是 admission/replica 不够”。 +- `launch_safety`: TP2/TP4 均满足 topology constraints,没有重复 failed signature。 +- `regression_risk`: TP 增加会带来通信开销,可能损害 per-GPU efficiency,所以必须用 + `request_rate_per_gpu` 验证。 +- `measurement_cost`: 每个 GPU trial 成本高;因此高信息量的 topology probe 优先于 + 多个局部 runtime tweak。 -| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis | +实际结果验证了这个 scoring: + +| Arm | Trial | Patch | req/s/GPU | Pass rate | 解释 | | --- | ---: | --- | ---: | ---: | --- | -| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. | -| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. | -| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. | -| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. | +| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | 相邻 TP probe 已满足 SLO,但仍未饱和 search high。 | +| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | TP frontier 继续缓解 prefill bottleneck,达到 reference best。 | +| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | 弱模型也选择同一机制路径。 | +| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | 弱模型加 harness 匹配强模型加 harness。 | -The stop was also harness-mediated. Both harness arms stopped after trial 2 -because the validator authorized `harness_stop` with: +## 形式化设计:validator stop + +Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 `stop validator`: ```text -search_high_saturated_by_incumbent +stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false ``` -The recorded stop diagnosis was: +本实验里 stop 的记录是: ```text -The incumbent's highest measured probe is feasible and is within the configured -binary-search resolution of search.high. +tuning_stop_reason: harness_stop +validator_reason: search_high_saturated_by_incumbent +diagnosis: The incumbent's highest measured probe is feasible and is within the +configured binary-search resolution of search.high. ``` -So the loop did not stop because an LLM guessed that tuning was done. It stopped -because the incumbent saturated the configured search interval under the SLO -within binary-search tolerance. +含义是: -## Which Knobs Were Tuned +1. 当前 incumbent 的最高测量 probe 已经 feasible。 +2. 该 feasible probe 距离 `search.high` 已经在 binary-search tolerance 内。 +3. 在当前搜索区间和 SLO 约束下,继续花 GPU trial 很难提高 measured objective。 +4. 因此 validator 授权 stop,并保留当前 incumbent。 -The winning harness configuration only changed topology: +这给 harness 带来了 stop discipline:它既不会因为 LLM 过早自信而随便停,也不会在 +已经 saturate search high 后继续 burn budget。 + +## 实际 tune 了哪些 knobs + +Harness winning path 只改了 topology: ```text base config + tensor-parallel-size=4, data-parallel-size=1 ``` -The harness did not tune local scheduler/cache/memory knobs in the winning path. -It deliberately tested topology before local runtime knobs because the active -bottleneck was single-request TTFT/prefill service time. +它没有在 winning path 中调 scheduler/cache/memory knobs,因为 `ttft_prefill` +bottleneck 下,首要动作是缩短单请求 prefill service time。 -The naive arms tuned a different knob family: +Naive arms 则走了另一个方向: -| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU | +| Arm | 所有 trials 使用的 topology | 变化过的 runtime knobs | Best req/s/GPU | | --- | --- | --- | ---: | | `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 | | `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 | -The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that -horizontal data parallelism should maximize request rate because the model fits -per GPU and TP would add communication overhead. Subsequent naive proposals kept -that DP-heavy topology and searched scheduler/cache/memory details around it. -Across 20 naive trial slots total, neither model entered the TP2/TP4 topology -frontier that solved the bottleneck. +`gpt55_naive` 的第一个 proposal 明确选择 `TP=1, DP=8`,理由是模型能单卡放下, +因此 horizontal data parallelism 应该最大化 request rate,而 TP 会带来通信开销。 +之后 naive proposals 一直保留 DP-heavy topology,只围绕 runtime knobs 搜索。 +两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。 -## Why This Beats Baseline +## 为什么比 baseline 更好 -The baseline failed because it optimized the wrong causal path. +Baseline 失败的原因是优化了错误的因果路径。 -For a TTFT/prefill-bound workload, the relevant service-time term is the latency -of one request's prefill path. A DP-heavy topology can run more independent -replicas, but each replica still handles a long prompt with TP1 compute latency. -Under a tight per-request TTFT SLO, those replicas do not unlock a much higher -feasible `sampling_u`, and the objective divides by GPU usage. This is why -`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs. +对 `ttft_prefill`-bound workload,关键服务时间是单个请求的 prefill latency。 +DP-heavy topology 可以增加 replica 数,但每个 replica 仍用 TP1 处理长 prompt; +它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下,这会导致 feasible +`sampling_u` 很低;再除以 GPU 数得到 `req/s/GPU` 后,结果只有 +`0.02-0.027 req/s/GPU`。 -The harness changed the optimization direction: +Harness 的优化路径是: ```text -observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier --> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated +observed SLO pressure +-> classify as ttft_prefill +-> choose legal TP frontier probe +-> measure feasible req/s/GPU under the same SLO +-> stop only when search.high is saturated by incumbent ``` -That sequence is measurable and falsifiable. If TP4 had improved raw latency but -materially regressed `request_rate_per_gpu`, the harness proposal said it should -reject the hypothesis. If the bottleneck had been admission/queueing with healthy -TTFT/TPOT service times, the same knob-effect model would have favored DP or -`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was -"`ttft_prefill` evidence makes TP frontier the next highest-information probe -under current constraints." +这条路径是可测量、可反驳的。如果 TP4 降低了 latency 但 +`request_rate_per_gpu` 明显下降,harness 会 reject 这个 hypothesis。如果 +bottleneck 是 admission/queueing 而不是 TTFT/prefill,同一个 knob-effect model +会偏向 DP 或 `max-num-seqs`,而不是 TP frontier。 -This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the -harness converged to exactly the same TP frontier and final throughput as -`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the -wrong DP-heavy family for its whole budget. The ablation therefore attributes the -gain to the structured harness state and validators, not merely to a stronger -language model or a more verbose prompt. +因此,这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是: +harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family, +再用 per-GPU objective 和 validator stop 验证这个方向。 -## Evidence Boundary +## 证据边界 -This report strongly supports the harness mechanism on the Qwen27B tight-SLO -case and the model-strength ablation. It should not be overclaimed as universal -proof by itself. The correct generalization claim is narrower: +这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制,但不能单独当作通用性证明。 +当前可成立的结论是: -- In this case, the harness improved final quality, convergence speed, AUC, and - stop discipline. -- The harness made a weaker model match the stronger harnessed model and beat - the stronger naive model by more than 16x. -- The successful decision was expressed in generic terms: SLO-derived - bottleneck classification, topology constraints, knob-effect scoring, - per-GPU objective, and validator-authorized stop. -- Additional cases are still needed to show the same mechanism across different - bottlenecks, for example prefill scheduler pressure, decode TPOT pressure, - memory/KV pressure, and admission/queueing pressure. +- 在这个 case 中,harness 同时提升了 final quality、convergence speed、AUC 和 + stop discipline。 +- `gpt-5.4-mini + harness` 匹配 `gpt-5.5 + harness`,并显著超过 + `gpt-5.5 + naive`,说明收益主要来自 harness 的结构化状态和 validator,而不是 + 单纯来自更强模型。 +- 成功路径用的是通用机制:SLO-derived bottleneck classification、topology + constraints、knob-effect scoring、per-GPU objective、validator-authorized stop。 +- 还需要在其他 bottleneck/case 上继续验证,例如 prefill scheduler pressure、 + decode TPOT pressure、memory/KV pressure、admission/queueing pressure。 -## Original Aggregate Report +## 原始 aggregate report 摘录 ```text # qwen27b-tight-2x2-aggregate-20260623T005838Z