From 861d754f29e7385a1a587dda04364224328b841c Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Tue, 23 Jun 2026 18:14:35 +0800
Subject: [PATCH] Localize Qwen27B harness ablation doc

---
 ...en27b-tight-2x2-model-ablation-20260623.md | 351 ++++++++++++------
 1 file changed, 233 insertions(+), 118 deletions(-)

diff --git a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
index dab994c..fbd0db7 100644
--- a/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
+++ b/docs/harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md
@@ -1,37 +1,38 @@
-# Qwen27B Tight-SLO 2x2 Harness Ablation - 2026-06-23
+# Qwen27B tight-SLO 2x2 harness ablation - 2026-06-23
 
-This note organizes the aggregate report generated at:
+本文整理以下 aggregate report，并解释 harness 为什么能够让 tuning 更快、更有效：
 
 ```text
 .aituner-reports/qwen27b-tight-2x2-aggregate-20260623T005838Z/report.md
 ```
 
-The experiment is a 2x2 ablation: model strength crossed with `use_harness`.
-It asks whether the harness supplies reusable search structure beyond a stronger
-LLM's free-form tuning proposals.
+这个实验是一个 2x2 ablation：模型强弱和是否启用 `use_harness` 交叉。
+核心问题是：harness 是否提供了可复用的搜索结构，而不仅仅是更强 LLM
+或者更长 prompt 带来的偶然收益。
 
-## Experiment Design
+## 实验设计
 
-Case: `qwen27b-tight-slo-2x2-aggregate`.
+Case: `qwen27b-tight-slo-2x2-aggregate`。
 
-Substrate:
+实验基座：
 
-- Model served: `qwen3.5-27b-256k-0223-internal`.
-- Hardware: H20, up to 8 GPUs.
-- Trace: `chat_w20260311_1000`, input length filtered to 0-8192 tokens,
-  `replay_time_scale=1.0`, `max_concurrency=32`.
-- SLO: pass rate >= 0.95, TTFT step rule of 2s for <=4096 input tokens,
-  4s for <=32768 input tokens, 6s above that, and TPOT <= 50 ms.
-- Search: `sampling_u` in `[0, 0.0625]`, tolerance 0.001, max 6 probes.
-- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`.
+- Served model: `qwen3.5-27b-256k-0223-internal`。
+- Hardware: H20，最多 8 GPUs。
+- Trace: `chat_w20260311_1000`，输入长度过滤到 0-8192 tokens，
+  `replay_time_scale=1.0`，`max_concurrency=32`。
+- SLO: pass rate >= 0.95；TTFT step rule 为 <=4096 input tokens 时 2s，
+  <=32768 input tokens 时 4s，更长输入时 6s；TPOT <= 50 ms。
+- Search: 在 `sampling_u in [0, 0.0625]` 上二分探测，tolerance 0.001，
+  max 6 probes。
+- Tunable envs: `VLLM_ENABLE_TORCH_COMPILE`。
 - Tunable flags: `tensor-parallel-size`, `data-parallel-size`,
   `expert-parallel-size`, `gpu-memory-utilization`, `block-size`,
   `max-num-batched-tokens`, `max-num-seqs`, `enable-prefix-caching`,
-  `enable-chunked-prefill`.
-- Topology constraints: TP and DP in `{1,2,4,8}`, allowed TP*DP products in
-  `{1,2,4,8}`, EP fixed to 1 for this case.
+  `enable-chunked-prefill`。
+- Topology constraints: TP 和 DP 均在 `{1,2,4,8}` 中，允许的 TP*DP product 为
+  `{1,2,4,8}`，本 case 中 EP 固定为 1。
 
-Arms:
+2x2 arms:
 
 | Arm | Tuner model | Harness | Trial budget used |
 | --- | --- | --- | ---: |
@@ -40,15 +41,13 @@ Arms:
 | `gpt54mini_harness` | `gpt-5.4-mini` | on | 2 |
 | `gpt54mini_naive` | `gpt-5.4-mini` | off | 10 |
 
-The only intended axis inside each model pair is `use_harness`. The aggregate
-then compares whether the weaker model plus harness can match or exceed the
-stronger model without harness.
+同一个 tuner model 内，主要差异是 `use_harness`。跨模型比较则用来判断：
+更弱模型加 harness 是否能匹配或超过更强模型的 naive tuning。
 
-## Aggregate Result
+## Aggregate result
 
-Reference best: `0.4429 req/s/GPU`.
-Target threshold for convergence comparisons: 95% of reference, or
-`0.4208 req/s/GPU`.
+Reference best: `0.4429 req/s/GPU`。
+Convergence target: reference 的 95%，即 `0.4208 req/s/GPU`。
 
 | Arm | Kind | Trials | Final req/s/GPU | Final/ref | Trials to target | Normalized AUC | Failed | No feasible |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
@@ -57,42 +56,154 @@ Target threshold for convergence comparisons: 95% of reference, or
 | `gpt54mini_harness` | harness | 2 | 0.4429 | 1.0000 | 2 | 0.9450 | 0 | 0 |
 | `gpt54mini_naive` | naive | 10 | 0.0231 | 0.0522 | - | 0.0498 | 1 | 1 |
 
-Harness wins both harness-vs-naive checks:
+Harness-vs-naive 检查全部通过：
 
 | Harness arm | Final vs best naive | AUC vs best naive | Pass |
 | --- | ---: | ---: | --- |
 | `gpt55_harness` | 16.2290x | 16.1296x | true |
 | `gpt54mini_harness` | 16.2290x | 16.0720x | true |
 
-The strongest ablation observation is that `gpt-5.4-mini + harness` matches
-`gpt-5.5 + harness` at the same final throughput and the same trials-to-target,
-while both naive arms remain more than 16x below the harness arms by final
-per-GPU throughput and AUC.
+最关键的 ablation 信号是：`gpt-5.4-mini + harness` 和
+`gpt-5.5 + harness` 达到同一个 final throughput，也都是 2 trials 达到 target；
+而两个 naive arms 用满 10 trials 后仍低于 harness arms 16x 以上。
 
-## What The Harness Actually Did
+## Agent loop 流程图
 
-The harness did not perform generic "better prompting". It inserted a measured,
-structured decision protocol between trial results and the next proposal.
+下面是当前 harness 化 agent loop 的抽象流程。LLM 仍然可以参与 proposal，
+但它拿到的不是裸文本历史，而是结构化 observation、bottleneck diagnosis、
+candidate actions 和 validator 约束；同时 validator 可以授权 stop，也可以阻止
+重复失败或不合法配置。
 
-Formally, after each trial `t`, AITuner observes:
-
-```text
-o_t = (config_t, probe history_t, pass-rate_t, latency/SLO failures_t,
-       request_rate_t, parallel_size_t, launch status_t)
+```mermaid
+flowchart TD
+    A[Study spec: trace, SLO, search range, tunable knobs] --> B[Run one engine config]
+    B --> C[Binary-search probes over sampling_u]
+    C --> D[Build observation o_t]
+    D --> E[Bottleneck classifier]
+    E --> F[Candidate family generator]
+    F --> G[Score candidate actions]
+    G --> H[Prompt renderer / planner]
+    H --> I[LLM or deterministic harness proposal]
+    I --> J{Config validator}
+    J -- invalid, repeated, unsafe --> F
+    J -- valid config_patch --> B
+    G --> K{Stop validator}
+    K -- search_high_saturated_by_incumbent --> L[Stop and keep incumbent]
+    K -- useful candidates remain --> H
 ```
 
-and optimizes:
+这个 loop 中，harness 的作用不是把 prompt 写得更漂亮，而是把 tuning 变成
+一个受测量约束的决策过程：
+
+```text
+measurement -> diagnosis -> candidate family -> scored action -> validated proposal/stop
+```
+
+## 形式化设计：observation
+
+每个 trial 结束后，AITuner 不只记录一段自然语言总结，而是形成结构化 observation：
+
+```text
+o_t = (
+  config_t,
+  probe_history_t,
+  pass_rate_t,
+  latency/SLO_failure_profile_t,
+  request_rate_t,
+  parallel_size_t,
+  launch_status_t,
+  prior_failures_t,
+  incumbent_t
+)
+```
+
+本实验里 observation 中最重要的字段是：
+
+- `config_t`: 当前 trial 的 `flag_patch` 和 `env_patch`，例如 `TP=2, DP=1`。
+- `probe_history_t`: 在不同 `sampling_u` 下二分探测得到的 feasible/infeasible
+  结果。
+- `pass_rate_t`: 是否满足 target pass rate 0.95。
+- `latency/SLO_failure_profile_t`: TTFT 和 TPOT 哪个先触发 SLO pressure。
+- `request_rate_t`: 当前配置在 SLO 下能承载的 request rate。
+- `parallel_size_t`: 该配置实际使用的并行规模，用于归一化 per-GPU objective。
+- `prior_failures_t`: 之前哪些配置 launch failed 或 no feasible，避免重复试错。
+- `incumbent_t`: 当前最优配置及其 `request_rate_per_gpu`。
+
+目标函数是：
 
 ```text
 J(config_t) = request_rate_t / parallel_size_t
-subject to pass_rate_t >= 0.95.
+subject to pass_rate_t >= 0.95
 ```
 
-The harness maps the observation into:
+也就是说，harness 优化的是满足 SLO 后的 `req/s/GPU`，不是 raw throughput，
+也不是 LLM 主观认为“更强”的配置。
+
+## 形式化设计：bottleneck classifier
+
+`bottleneck classifier` 把 observation 映射成 ranked bottleneck hypotheses：
 
 ```text
 b_t = ranked_bottleneck(o_t)
-A_t = candidate_knob_families(b_t, topology_constraints, prior_failures)
+```
+
+它判断的不是“哪个 knob 看起来常用”，而是“当前 SLO failure 和 latency profile
+说明哪个系统环节在限制 objective”。
+
+常见分类包括：
+
+| Bottleneck | 典型证据 | 倾向 knob family |
+| --- | --- | --- |
+| `ttft_prefill` | 长 prompt 下 TTFT 接近或超过 SLO，prefill service time 是瓶颈 | 提高 TP，调整 prefill batching |
+| `decode_tpot` | TPOT p95/p99 超 SLO，decode token latency 是瓶颈 | 调整 `max-num-seqs`，提高 TP，降低 decode contention |
+| `admission_queueing` | waiting/arrival lag 增长，服务时间未必单独变差 | 提高 DP，调整 admission/concurrency knobs |
+| `memory_kv` | KV cache pressure、preemption、OOM、launch failure | 调整 `gpu-memory-utilization`、`block-size`、sequence/token caps |
+| `topology_comm` | TP 增加降低 latency 但 per-GPU efficiency 下降 | 回退 TP，比较 DP/TP tradeoff |
+
+本实验里，两个 harness arms 都把 ranked bottleneck 识别为
+`ttft_prefill`。原因是 workload 有 heavy-tailed long prompts，并且 TTFT SLO 很紧；
+这意味着单个请求的 prefill service time 是主要限制。DP-only 只能增加 replica，
+不能缩短一个长 prompt 的 prefill 路径，因此不是第一优先级。
+
+## 形式化设计：candidate family
+
+`candidate family generator` 根据 bottleneck 和 topology constraints 生成可比较的
+action family：
+
+```text
+A_t = candidate_knob_families(
+  b_t,
+  topology_constraints,
+  prior_failures_t,
+  incumbent_t
+)
+```
+
+在这个 case 中：
+
+- `b_t = ttft_prefill`。
+- 允许的 TP frontier 是 `TP=1 -> TP=2 -> TP=4 -> TP=8`。
+- 允许的 DP frontier 是 `DP=1,2,4,8`，但 DP-only 不直接缓解单请求 prefill
+  latency。
+- EP 固定为 1，因此不探索 expert parallel。
+- 之前没有 failed topology，因此相邻 TP probe launch risk 低。
+
+所以 harness 选择了：
+
+```text
+trial-0001: TP=2, DP=1
+trial-0002: TP=4, DP=1
+```
+
+这不是写死“Qwen27B 应该 TP4”。如果 classifier 输出的是
+`admission_queueing`，candidate family 会更偏向 DP 或 `max-num-seqs`；如果输出是
+`memory_kv`，则会更偏向 memory/cache/sequence knobs。
+
+## 形式化设计：scoring
+
+每个 candidate action 都按同一个抽象打分：
+
+```text
 score(a) = expected_bottleneck_relief(a)
          + information_gain(a)
          + launch_safety(a)
@@ -100,118 +211,122 @@ score(a) = expected_bottleneck_relief(a)
          - measurement_cost(a)
 ```
 
-For this workload, the ranked bottleneck was `ttft_prefill`: long, heavy-tailed
-prompts and a tight TTFT SLO made single-request prefill service time the
-active limiter. Under that bottleneck, the high-value candidate family is a
-legal TP frontier probe, because increasing TP can reduce prefill compute
-latency for one request. DP-only scaling adds replicas but does not shorten the
-single-request prefill path, so it can improve aggregate admission while still
-failing the per-request TTFT bottleneck and the per-GPU objective.
+这些项在本实验里的含义是：
 
-The actual harness trajectory was:
+- `expected_bottleneck_relief`: TP2/TP4 预计能降低 long-prefill compute latency，
+  直接作用于 `ttft_prefill`。
+- `information_gain`: TP frontier probe 可以区分“需要 compute-latency relief”
+  还是“只是 admission/replica 不够”。
+- `launch_safety`: TP2/TP4 均满足 topology constraints，没有重复 failed signature。
+- `regression_risk`: TP 增加会带来通信开销，可能损害 per-GPU efficiency，所以必须用
+  `request_rate_per_gpu` 验证。
+- `measurement_cost`: 每个 GPU trial 成本高；因此高信息量的 topology probe 优先于
+  多个局部 runtime tweak。
 
-| Arm | Trial | Patch | req/s/GPU | Pass rate | Diagnosis |
+实际结果验证了这个 scoring：
+
+| Arm | Trial | Patch | req/s/GPU | Pass rate | 解释 |
 | --- | ---: | --- | ---: | ---: | --- |
-| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | TTFT/prefill; adjacent TP increase should reduce long-prefill latency. |
-| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | Ranked bottleneck is `ttft_prefill`; compare TP4 vs TP2 to distinguish compute-latency relief from replica/admission effects. |
-| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | TTFT/prefill; adjacent TP increase is the safest throughput-improving probe. |
-| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | Same `ttft_prefill` topology test as the stronger model. |
+| `gpt55_harness` | 1 | `TP=2, DP=1` | 0.2142 | 0.9572 | 相邻 TP probe 已满足 SLO，但仍未饱和 search high。 |
+| `gpt55_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9718 | TP frontier 继续缓解 prefill bottleneck，达到 reference best。 |
+| `gpt54mini_harness` | 1 | `TP=2, DP=1` | 0.1992 | 0.9707 | 弱模型也选择同一机制路径。 |
+| `gpt54mini_harness` | 2 | `TP=4, DP=1` | 0.4429 | 0.9727 | 弱模型加 harness 匹配强模型加 harness。 |
 
-The stop was also harness-mediated. Both harness arms stopped after trial 2
-because the validator authorized `harness_stop` with:
+## 形式化设计：validator stop
+
+Stop 不是 LLM 自己说“我觉得差不多了”。Stop 必须通过 `stop validator`：
 
 ```text
-search_high_saturated_by_incumbent
+stop(o_t, incumbent_t, search_state_t, candidate_set_t) -> true/false
 ```
 
-The recorded stop diagnosis was:
+本实验里 stop 的记录是：
 
 ```text
-The incumbent's highest measured probe is feasible and is within the configured
-binary-search resolution of search.high.
+tuning_stop_reason: harness_stop
+validator_reason: search_high_saturated_by_incumbent
+diagnosis: The incumbent's highest measured probe is feasible and is within the
+configured binary-search resolution of search.high.
 ```
 
-So the loop did not stop because an LLM guessed that tuning was done. It stopped
-because the incumbent saturated the configured search interval under the SLO
-within binary-search tolerance.
+含义是：
 
-## Which Knobs Were Tuned
+1. 当前 incumbent 的最高测量 probe 已经 feasible。
+2. 该 feasible probe 距离 `search.high` 已经在 binary-search tolerance 内。
+3. 在当前搜索区间和 SLO 约束下，继续花 GPU trial 很难提高 measured objective。
+4. 因此 validator 授权 stop，并保留当前 incumbent。
 
-The winning harness configuration only changed topology:
+这给 harness 带来了 stop discipline：它既不会因为 LLM 过早自信而随便停，也不会在
+已经 saturate search high 后继续 burn budget。
+
+## 实际 tune 了哪些 knobs
+
+Harness winning path 只改了 topology：
 
 ```text
 base config + tensor-parallel-size=4, data-parallel-size=1
 ```
 
-The harness did not tune local scheduler/cache/memory knobs in the winning path.
-It deliberately tested topology before local runtime knobs because the active
-bottleneck was single-request TTFT/prefill service time.
+它没有在 winning path 中调 scheduler/cache/memory knobs，因为 `ttft_prefill`
+bottleneck 下，首要动作是缩短单请求 prefill service time。
 
-The naive arms tuned a different knob family:
+Naive arms 则走了另一个方向：
 
-| Arm | Topology used in all trials | Runtime knobs varied | Best req/s/GPU |
+| Arm | 所有 trials 使用的 topology | 变化过的 runtime knobs | Best req/s/GPU |
 | --- | --- | --- | ---: |
 | `gpt55_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization`, prefix caching, chunked prefill | 0.0273 |
 | `gpt54mini_naive` | `TP=1, DP=8` | `max-num-batched-tokens`, `max-num-seqs`, `block-size`, `gpu-memory-utilization` | 0.0231 |
 
-The first `gpt55_naive` proposal explicitly chose `TP=1, DP=8`, reasoning that
-horizontal data parallelism should maximize request rate because the model fits
-per GPU and TP would add communication overhead. Subsequent naive proposals kept
-that DP-heavy topology and searched scheduler/cache/memory details around it.
-Across 20 naive trial slots total, neither model entered the TP2/TP4 topology
-frontier that solved the bottleneck.
+`gpt55_naive` 的第一个 proposal 明确选择 `TP=1, DP=8`，理由是模型能单卡放下，
+因此 horizontal data parallelism 应该最大化 request rate，而 TP 会带来通信开销。
+之后 naive proposals 一直保留 DP-heavy topology，只围绕 runtime knobs 搜索。
+两个 naive arms 合计 20 个 trial slots 都没有进入 TP2/TP4 topology frontier。
 
-## Why This Beats Baseline
+## 为什么比 baseline 更好
 
-The baseline failed because it optimized the wrong causal path.
+Baseline 失败的原因是优化了错误的因果路径。
 
-For a TTFT/prefill-bound workload, the relevant service-time term is the latency
-of one request's prefill path. A DP-heavy topology can run more independent
-replicas, but each replica still handles a long prompt with TP1 compute latency.
-Under a tight per-request TTFT SLO, those replicas do not unlock a much higher
-feasible `sampling_u`, and the objective divides by GPU usage. This is why
-`TP=1, DP=8` stayed near `0.02-0.027 req/s/GPU` despite using all GPUs.
+对 `ttft_prefill`-bound workload，关键服务时间是单个请求的 prefill latency。
+DP-heavy topology 可以增加 replica 数，但每个 replica 仍用 TP1 处理长 prompt；
+它不能显著缩短单请求 prefill path。在 tight TTFT SLO 下，这会导致 feasible
+`sampling_u` 很低；再除以 GPU 数得到 `req/s/GPU` 后，结果只有
+`0.02-0.027 req/s/GPU`。
 
-The harness changed the optimization direction:
+Harness 的优化路径是：
 
 ```text
-observed SLO pressure -> classify as TTFT/prefill -> prefer legal TP frontier
--> measure per-GPU feasible rate under the same SLO -> stop when search.high is saturated
+observed SLO pressure
+-> classify as ttft_prefill
+-> choose legal TP frontier probe
+-> measure feasible req/s/GPU under the same SLO
+-> stop only when search.high is saturated by incumbent
 ```
 
-That sequence is measurable and falsifiable. If TP4 had improved raw latency but
-materially regressed `request_rate_per_gpu`, the harness proposal said it should
-reject the hypothesis. If the bottleneck had been admission/queueing with healthy
-TTFT/TPOT service times, the same knob-effect model would have favored DP or
-`max-num-seqs` instead. The decision was not "Qwen27B needs TP4"; it was
-"`ttft_prefill` evidence makes TP frontier the next highest-information probe
-under current constraints."
+这条路径是可测量、可反驳的。如果 TP4 降低了 latency 但
+`request_rate_per_gpu` 明显下降，harness 会 reject 这个 hypothesis。如果
+bottleneck 是 admission/queueing 而不是 TTFT/prefill，同一个 knob-effect model
+会偏向 DP 或 `max-num-seqs`，而不是 TP frontier。
 
-This is also why the weak-model arm matters. The weaker `gpt-5.4-mini` with the
-harness converged to exactly the same TP frontier and final throughput as
-`gpt-5.5 + harness`, while the stronger `gpt-5.5` without harness stayed in the
-wrong DP-heavy family for its whole budget. The ablation therefore attributes the
-gain to the structured harness state and validators, not merely to a stronger
-language model or a more verbose prompt.
+因此，这个结果不是“Qwen27B case 里我们 prompt 诱导模型说 TP4”。更准确的结论是：
+harness 用 SLO-derived bottleneck evidence 把搜索导向了正确的 knob family，
+再用 per-GPU objective 和 validator stop 验证这个方向。
 
-## Evidence Boundary
+## 证据边界
 
-This report strongly supports the harness mechanism on the Qwen27B tight-SLO
-case and the model-strength ablation. It should not be overclaimed as universal
-proof by itself. The correct generalization claim is narrower:
+这份报告强支撑 Qwen27B tight-SLO case 上的 harness 机制，但不能单独当作通用性证明。
+当前可成立的结论是：
 
-- In this case, the harness improved final quality, convergence speed, AUC, and
-  stop discipline.
-- The harness made a weaker model match the stronger harnessed model and beat
-  the stronger naive model by more than 16x.
-- The successful decision was expressed in generic terms: SLO-derived
-  bottleneck classification, topology constraints, knob-effect scoring,
-  per-GPU objective, and validator-authorized stop.
-- Additional cases are still needed to show the same mechanism across different
-  bottlenecks, for example prefill scheduler pressure, decode TPOT pressure,
-  memory/KV pressure, and admission/queueing pressure.
+- 在这个 case 中，harness 同时提升了 final quality、convergence speed、AUC 和
+  stop discipline。
+- `gpt-5.4-mini + harness` 匹配 `gpt-5.5 + harness`，并显著超过
+  `gpt-5.5 + naive`，说明收益主要来自 harness 的结构化状态和 validator，而不是
+  单纯来自更强模型。
+- 成功路径用的是通用机制：SLO-derived bottleneck classification、topology
+  constraints、knob-effect scoring、per-GPU objective、validator-authorized stop。
+- 还需要在其他 bottleneck/case 上继续验证，例如 prefill scheduler pressure、
+  decode TPOT pressure、memory/KV pressure、admission/queueing pressure。
 
-## Original Aggregate Report
+## 原始 aggregate report 摘录
 
 ```text
 # qwen27b-tight-2x2-aggregate-20260623T005838Z