Document no-LLM harness mechanism

2026-06-25 10:32:29 +08:00
parent 013b01baa1
commit ce36cd79af
3 changed files with 385 additions and 3 deletions
--- a/docs/aituner-harness-summary.md
+++ b/docs/aituner-harness-summary.md
@@ -1,5 +1,23 @@
 # AITuner Harness Summary

+## No-LLM Deterministic Planner
+
+当前 harness 不只是给 LLM 的 prompt hints。它已经可以在没有 LLM endpoint 的情况下，
+作为 deterministic planner 完成一整轮 tuning：
+
+1. 先运行 baseline，得到真实 probe/SLO evidence。
+2. 从 probe history 构造 trial profile 和 bottleneck hypotheses。
+3. 从 topology/runtime intervention grammar 中生成合法 candidate actions。
+4. 用 expected relief、information gain、launch safety 和 regression risk 给候选打分。
+5. 若高分候选存在，直接写出 `harness-proposal-XXXX`。
+6. 若候选耗尽，且 validator 证明 post-incumbent validation 已充分，写出
+   `harness-stop-XXXX`。
+7. 只有 harness 既不能 propose 也不能 stop 时，才调用 LLM；如果没有 LLM endpoint，
+   tune loop 会显式失败。
+
+完整机制和 Qwen30B no-LLM 真实轨迹见：
+[No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md)。
+
 ## What The Harness Adds

 The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.
--- a/docs/aituner-roadmap.md
+++ b/docs/aituner-roadmap.md
@@ -72,10 +72,10 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的

 | Claim | 当前状态 | 证据文档 | 关键缺口 |
 | --- | --- | --- | --- |
-| C1. Harness 将 raw knob search 转成 mechanism-guided intervention search，提升固定预算优化效果 | 已有强信号 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 补 Qwen235B decode 2x2 aggregate；补 mechanism ablation |
-| C2. 收益来自 harness-defined substrate，不依赖某个强 LLM | 部分已有 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md) | 做 `BO/heuristic + harness` vs `BO/heuristic + raw knobs` |
+| C1. Harness 将 raw knob search 转成 mechanism-guided intervention search，提升固定预算优化效果 | 已有强信号 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 补 Qwen235B decode 2x2 aggregate；补 mechanism ablation |
+| C2. 收益来自 harness-defined substrate，不依赖某个强 LLM | Qwen30B no-LLM 已完整闭环；Qwen27B 弱/强模型已有 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md) | 做 `BO/heuristic + harness` vs `BO/heuristic + raw knobs` |
 | C3. Weak planner + harness 可以匹配或超过 strong LLM naive | Qwen27B 已支持；Qwen235B 正在补 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen235B prefill progress](harness-ablation/qwen235b-prefill-2x2-progress-20260623.md) | 完成 Qwen235B decode 2x2；更新 prefill final doc |
-| C4. Attribution 和 intervention grammar 有机制贡献，不只是 prompt 信息更多 | 设计已有，严格证据不足 | [AITuner summary](aituner-harness-summary.md) | 做 shuffled attribution / no attribution / no grammar / no topology-first / no validator ablation |
+| C4. Attribution 和 intervention grammar 有机制贡献，不只是 prompt 信息更多 | 设计和 no-LLM case 已整理；严格 ablation 不足 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [AITuner summary](aituner-harness-summary.md) | 做 shuffled attribution / no attribution / no grammar / no topology-first / no validator ablation |
 | C5. AITuner 找到 near-optimal region，而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
 | C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep；同时画 SLO tightness -> frontier/regime transition |
 | C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行，暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case |
--- a/docs/harness-ablation/no-llm-harness-mechanism-20260625.md
+++ b/docs/harness-ablation/no-llm-harness-mechanism-20260625.md
@@ -0,0 +1,364 @@
+# No-LLM Harness Mechanism - 2026-06-25
+
+本文回答一个核心问题：如果不调用 LLM，harness 为什么还能自动找到配置？
+
+结论先说清楚：no-LLM 模式下并不是“没有 planner”。当前 harness 本身就是一个
+deterministic planner。LLM 在 AITuner 里只是一个可替换的 proposal backend；当
+harness 能从观测、瓶颈归因、候选 family 和 stop validator 中推出下一步时，tuning
+loop 会直接使用 harness proposal，而不会请求 LLM。
+
+## Tune loop 中 LLM 的位置
+
+`study tune` 每轮的决策顺序是：
+
+```text
+state + study spec + workload/probe results
+        |
+        v
+build_harness_context(...)
+        |
+        +--> build_harness_stop_proposal(context)
+        |       if true: write harness-stop and exit
+        |
+        +--> build_harness_guided_proposal(context)
+        |       if true: run this deterministic proposal
+        |
+        +--> call_llm_for_proposal(...)
+                only if no harness stop/proposal exists
+```
+
+因此在 `study.llm.endpoint = null` 的 no-LLM run 中，只要 harness 每轮都能给出
+一个 deterministic proposal 或 deterministic stop，整个实验就可以完全不调用 LLM。
+如果 harness 既不能 propose 也不能 stop，且没有 LLM endpoint，AITuner 会报错，而不是
+偷偷退化成随机搜索。
+
+当前 Qwen30B stopfix run 就是这种完整闭环：
+
+```text
+.aituner/qwen30b-harness-only-medium-stopfix-dash1-20260624T144701Z/
+```
+
+它没有 LLM endpoint，但仍完成了 9 个 measured trials，并最终由 validator 写出
+`harness_stop`。
+
+## Harness 做的不是 prompt engineering
+
+Harness 做的事情可以形式化成：
+
+```text
+H = (O, B, G, S, V)
+
+O: Observation schema
+   将 workload、trial probes、SLO failure、launch failure、topology constraints
+   转成结构化状态。
+
+B: Bottleneck attribution
+   将 SLO violation 归因到 serving regime，例如 ttft_prefill、decode_tpot、
+   admission_or_queueing、launch_or_memory。
+
+G: Intervention grammar
+   将 raw knobs 组织成有语义的 candidate families，例如 topology、batching、
+   sequence admission、KV memory headroom。
+
+S: Scoring policy
+   对候选 intervention 评分，选择最有信息量且最可能提升 SLO-constrained
+   req/s/GPU 的下一步。
+
+V: Validator / stop policy
+   阻止非法、重复、已知失败或无意义的 proposal；只有在剩余高价值候选被测完后
+   才允许 stop。
+```
+
+LLM 可以读取这些结构化信息并生成 proposal，但 no-LLM 时 `H` 自己就能生成
+proposal。换句话说，我们的核心是把：
+
+```text
+raw config vector search
+```
+
+转成：
+
+```text
+mechanism-guided intervention search
+```
+
+这就是为什么没有 LLM 也能工作。
+
+## Agent loop 流程图
+
+```mermaid
+flowchart TD
+  A[Baseline or latest measured trial] --> B[Load probe history and trial result]
+  B --> C[Build workload L-C-A profile]
+  B --> D[Build TrialProfile]
+  C --> E[Rank bottleneck hypotheses]
+  D --> E
+  E --> F[Generate legal candidate actions]
+  F --> G[Score candidates]
+  G --> H{High-value candidate?}
+  H -- yes --> I[Emit harness-proposal]
+  I --> J[Run real vLLM trial over search range]
+  J --> B
+  H -- no --> K{Validator stop allowed?}
+  K -- yes --> L[Emit harness-stop]
+  K -- no --> M{LLM endpoint exists?}
+  M -- yes --> N[Ask LLM backend]
+  M -- no --> O[Fail loudly: no proposal source]
+```
+
+## Observation: harness 看到什么
+
+每一轮 harness 不看自然语言日志做猜测，而是读结构化状态：
+
+- `StudySpec`
+  - hardware: GPU 数量、GPU 型号；
+  - engine: base flags/envs、tunable flags/envs、topology constraints；
+  - trace: request mode、window id、输入长度过滤、输出长度 override；
+  - SLO: TTFT/TPOT rule、target pass rate；
+  - search: load range、tolerance、probe budget。
+- `window_summary` / `WorkloadProfile`
+  - L: request length 分布、tail ratio；
+  - C: prefix/cache reuse；
+  - A: arrival rate、burstiness、interarrival variation。
+- 最近 trials
+  - config patch；
+  - best feasible request rate；
+  - request_rate_per_gpu；
+  - pass rate；
+  - probe history；
+  - latency p50/p95/p99；
+  - SLO failure reason counts；
+  - launch/runtime failure stage。
+
+这些数据会被压成 `recent_trial_diagnostics` 和 `trial_profiles`，后续步骤只使用这些结构化
+字段。
+
+## Bottleneck classifier: 怎么判断方向
+
+Harness 维护一组 ranked bottleneck hypotheses：
+
+```text
+ttft_prefill
+decode_tpot
+admission_or_queueing
+launch_or_memory
+```
+
+它的输入不是单一阈值，而是多类证据：
+
+- workload default：长 prompt tail 更偏向 `ttft_prefill`；
+- request mode：decode-only 且有 TPOT SLO 时更偏向 `decode_tpot`；
+- probe sequence：最近 trial 的 active bottleneck 权重大于旧 trial；
+- failed reason counts：
+  - `ttft_ms>...` 支持 `ttft_prefill`；
+  - `tpot_ms>...` 支持 `decode_tpot`；
+  - `arrival_lag_s>` / `probe_elapsed_s>` 支持 `admission_or_queueing`；
+- launch failure / OOM：支持 `launch_or_memory`。
+
+代码里这不是一个硬编码单标签，而是带 confidence 的 ranked list。例如最近 probe
+明确出现 TPOT failure，会提高 `decode_tpot` 分数；如果同时 workload 有长 prompt tail，
+`ttft_prefill` 仍会保留为次级 hypothesis。
+
+## Candidate family: raw knobs 如何变成 intervention
+
+Harness 不直接在所有 tunable flags 上盲采样。它先把 knobs 分成有系统含义的
+intervention family：
+
+| Family | 代表 knobs | 机制含义 |
+| --- | --- | --- |
+| topology | `tensor-parallel-size`, `data-parallel-size`, EP knobs | 改变每请求并行度、replica 数量、通信/效率 tradeoff |
+| batching | `max-num-batched-tokens`, `enable-chunked-prefill` | 改变 prefill/decode batching 与 HoL blocking |
+| admission | `max-num-seqs` | 改变并发 admission 与 TPOT/TTFT tail |
+| KV memory | `gpu-memory-utilization` | 改变 KV cache blocks 和可承载并发 |
+| failure memory | failed signatures | 阻止重复已知 launch/runtime 失败方向 |
+
+关键点是：candidate 来自当前 `StudySpec` 的 tunable schema 和 topology constraints。
+例如 topology candidate 只枚举合法 TP/DP/EP 组合；如果 EP 没有直接证据，generic
+topology search 不会主动引入 EP。
+
+## Scoring: 为什么会先走 topology，再走 gmu
+
+Candidate action 的评分大致是：
+
+```text
+score = expected_bottleneck_relief * bottleneck_confidence
+      + information_gain
+      + launch_safety
+      - regression_risk
+```
+
+然后 `experiment_plan.next_action` 选择最高分候选。分数超过阈值时，harness 直接生成
+proposal；否则进入 stop validator 或 LLM fallback。
+
+这套 scoring 体现了几个系统原则：
+
+1. Topology 是 serving 的一阶决策。
+   当 TP frontier 还没测完，`gpu-memory-utilization`、`max-num-seqs` 这类 runtime
+   微调不会抢在 topology 前面。
+
+2. Topology 不是“越大越好”。
+   评分和最终 winner 都看 `request_rate_per_gpu`，不是 raw request rate。TP4 可能总吞吐
+   更高，但如果使用更多 GPU 后 per-GPU 效率下降，就不会成为 incumbent。
+
+3. Runtime tuning 必须 anchored on incumbent topology。
+   当 topology 已经验证过，runtime proposal 会 preserve 当前 best topology，只在其上
+   调 `gpu-memory-utilization`、`max-num-seqs`、`max-num-batched-tokens`。
+
+4. Measurement 决定最终答案。
+   Candidate 只是一个 hypothesis；是否接受由真实 trial 的 SLO-constrained
+   `request_rate_per_gpu` 决定。
+
+## Validator stop: 为什么不会过早停止
+
+Harness stop 不是“找到一个不错配置就停”。当前 stop validator 包含几个条件：
+
+- `search_high_saturated_by_incumbent`
+  - incumbent 的最高 feasible probe 已经贴近 configured search high；
+  - 说明当前测量范围已被打满。
+- `topology_frontier_requires_probe`
+  - 如果 active bottleneck 仍要求更高 TP frontier 且未测，禁止 stop。
+- `experiment_plan_has_high_value_candidate`
+  - 如果还有高分候选，禁止 stop。
+- `post_incumbent_validation_exhausted`
+  - strong incumbent 后至少要有 post-incumbent validation；
+  - validation 覆盖 topology/runtime family 或达到足够数量；
+  - 没有任何 validation trial 超过 incumbent；
+  - 才允许 clean stop。
+
+所以 validator 的作用是 fail-safe：
+
+```text
+wrong proposal 最多浪费一个 trial；
+wrong stop 会终止搜索，所以必须由 deterministic validator 授权。
+```
+
+## Qwen30B no-LLM run 中具体发生了什么
+
+Run:
+
+```text
+qwen30b-harness-only-medium-stopfix-dash1-20260624T144701Z
+```
+
+设置：
+
+- Model: `Qwen/Qwen3-30B-A3B`
+- Engine: community vLLM 0.20
+- Hardware: 8x H20, 允许 TP/DP/EP frontier
+- Trace: chat 0-8k, output 128, replay time scale 0.1
+- SLO: target pass rate 0.95, TTFT step rule, TPOT 50ms
+- LLM endpoint: `null`
+
+真实 trial path:
+
+| Trial | Source | Config patch | req/s/GPU | pass rate | Harness 解释 |
+| --- | --- | --- | ---: | ---: | --- |
+| 0001 | baseline | `{}` | 2.2000 | 1.0000 | 建立 baseline 和 probe evidence |
+| 0002 | harness | `TP=2` | 3.2583 | 1.0000 | latency/SLO pressure 下先测 adjacent TP |
+| 0003 | harness | `TP=4` | 2.0917 | 1.0000 | 验证更高 TP frontier；raw 总吞吐高但 per-GPU 低 |
+| 0004 | harness | `TP=2, gmu=0.92` | 3.2583 | 1.0000 | topology 已 settle，开始 incumbent topology 上的 KV headroom climb |
+| 0005 | harness | `TP=2, gmu=0.94` | 3.2583 | 1.0000 | 继续小步 hill-climb；未改善但未失败 |
+| 0006 | harness | `TP=2, gmu=0.96` | 3.3333 | 1.0000 | KV headroom 带来更高 feasible frontier |
+| 0007 | harness | `TP=2, gmu=0.97` | 3.4333 | 1.0000 | 达到 safe ceiling，成为 incumbent |
+| 0008 | harness | `TP=4, DP=2` | 1.0458 | 1.0000 | post-incumbent topology validation，没有超过 incumbent |
+| 0009 | harness | `TP=8` | 1.0458 | 1.0000 | 继续 frontier validation，没有超过 incumbent |
+| 0010 | harness stop | stop | - | - | validator: `post_incumbent_validation_exhausted` |
+
+这个过程里没有外部 LLM 决策。每一步 proposal 都来自 harness：
+
+1. baseline 观测到当前 engine 在 SLO 下的可行 frontier；
+2. bottleneck/机制模型认为 topology 是一阶干预；
+3. 测 TP2，接受，因为 per-GPU 从 2.2 提到 3.2583；
+4. 测 TP4，拒绝为 incumbent，因为 per-GPU 降到 2.0917；
+5. topology frontier settle 后，在 TP2 上小步提升 `gpu-memory-utilization`；
+6. `gmu=0.97` 达到 3.4333；
+7. 再测 nearby topology，确认没有更好；
+8. validator 授权 stop。
+
+## 为什么这不是写死 Qwen30B
+
+这条路径看起来像“harness 知道答案是 TP2+gmu0.97”，但代码机制不是这样写的。
+
+没有写死的部分：
+
+- 没有写死 model name；
+- 没有写死 Qwen30B；
+- 没有写死 `TP=2` 是最终答案；
+- 没有写死 `gmu=0.97` 一定最好；
+- 没有跳过真实测量；
+- 没有把 TP4/TP8 直接判负，而是实际运行并比较。
+
+真正写入 harness 的 domain knowledge 是：
+
+- TP/DP/EP 是 topology family，必须满足 topology constraints；
+- topology 通常是一阶 serving intervention，要先于 runtime 微调被验证；
+- raw throughput 不等于目标，跨 topology 比较要用 `request_rate_per_gpu`；
+- `gpu-memory-utilization` 是 KV memory headroom 微调，只应在 incumbent topology 上小步 hill-climb；
+- launch failure 和 tested signatures 是 hard negative evidence；
+- stop 必须由 validator 授权，不能由 proposer 自己说停就停。
+
+这是一种系统机制约束，不是 case-specific prompt。
+
+## 它和 BO / raw heuristic 的区别
+
+普通 BO 或 raw heuristic 的搜索空间通常是：
+
+```text
+config = {tp, dp, ep, gmu, max_num_seqs, max_num_batched_tokens, ...}
+score = measured req/s/GPU
+```
+
+这会产生几个问题：
+
+- 它不知道哪些 knobs 是 topology family，哪些是 runtime family；
+- 它可能在没测 TP frontier 前浪费大量 trial 调 runtime；
+- 它可能重复已知 launch failure；
+- 它可能把 raw throughput 高但 GPU efficiency 差的配置误当方向；
+- 它很难解释“这个 trial 试图证伪哪个瓶颈 hypothesis”。
+
+Harness-shaped search space 是：
+
+```text
+state -> bottleneck hypothesis -> legal intervention family -> measured verdict
+```
+
+因此 BO、bandit、LLM、deterministic heuristic 都可以接在 harness 后面。它们优化的不是
+raw knob vector，而是有 serving 语义的 intervention graph。
+
+这也是我们新 framing 的核心：
+
+```text
+black-box optimization
+  -> grey-box / mechanism-guided experimental optimization
+```
+
+## 当前还需要补的证据
+
+No-LLM Qwen30B run 证明了 deterministic harness 可以完整闭环，但 paper 还需要继续补：
+
+1. Planner-agnostic ablation
+   - `raw BO` vs `harness-guided BO`；
+   - `raw heuristic` vs `harness deterministic policy`；
+   - 证明收益来自 harness substrate，而不是某个 LLM。
+
+2. Mechanism ablation
+   - no attribution；
+   - shuffled attribution；
+   - no topology-first；
+   - no intervention grammar；
+   - no validator/failure memory。
+
+3. Near-optimum evidence
+   - 在 1-2 个 case 做局部 grid；
+   - 证明 harness path 找到的是 near-optimal region，不只是一个可行 config。
+
+4. Cross-case robustness
+   - 再选 decode-heavy 或 long-prefill case；
+   - 验证不同 workload/SLO 下 candidate family 会发生合理切换。
+
+## 一句话总结
+
+No-LLM harness 能自动找到配置，是因为它已经实现了一个面向 serving 机制的实验 planner：
+先把 trial 观测归因成 bottleneck，再把 bottleneck 映射成合法 intervention family，按
+SLO-constrained req/s/GPU 真实测量更新 incumbent，最后由 validator 判断是否停止。
+LLM 只是这个 planner 的一个可替换 proposal backend，而不是 AITuner 的必要核心。