Update AITuner roadmap framing

2026-06-24 11:45:42 +08:00
parent c0a9235b80
commit d85572e7b5
1 changed files with 186 additions and 52 deletions
--- a/docs/aituner-roadmap.md
+++ b/docs/aituner-roadmap.md
@@ -1,81 +1,215 @@
 # AITuner roadmap

-本文只维护最小 roadmap：我们想 claim 什么、已有证据在哪里、下一步缺哪块证据。
-详细实验过程放到对应专题文档里，不在这里堆流水账。
+本文只维护最小 roadmap：paper framing、claim 树、已有证据、最高优先级实验。
+详细实验流水账放到对应专题文档里。

-## Paper frame
+## Paper thesis

-AITuner 的核心不是“用 LLM 调参”，而是一个 SLO-aware tuning agent workflow：
+AITuner 的核心不是“用 LLM 调参”。更准确的 framing 是：

 ```text
-measurement -> observation -> bottleneck classifier -> candidate family
-            -> SLO-constrained scoring -> validator -> proposal / stop
+black-box knob optimization
+  -> grey-box / mechanism-guided experimental optimization
 ```

-LLM 的角色是 planner，不是唯一贡献。Harness 给 planner 提供 domain-specific
-system knowledge 和决策边界，使 tuning 从开放式 knob guessing 变成受测量约束的
-优化过程。
+也就是说，AITuner 仍然通过真实实验测量目标函数，但它不再把 serving engine 当成
+完全黑盒的 `config vector -> scalar score`。Harness 将 workload、SLO failure、
+probe trace、topology constraints 和 failure memory 转换成结构化的 serving
+mechanism state，并把下一步搜索限制在可解释、可验证的 intervention 上。

-## Scope decision
-
-当前 paper 主线先聚焦 vLLM serving engine，把 workflow/harness 机制论证完整：
+因此 LLM 不是不可替代的核心。LLM 是 planner backend / copilot；核心系统贡献是
+planner-agnostic 的 tuning substrate：

 ```text
-vLLM cases first: workflow/harness effectiveness, mechanism, robustness, near-optimum evidence
-multi-engine later: engine adapter abstraction, low adaptation cost, one SGLang-style validation case
+Harness H = (O, R, G, V, M)
+
+O: observation schema
+   workload L/C/A profile + probe trace + latency/SLO failure + launch status
+
+R: regime attribution
+   SLO violation -> prefill-bound / decode-bound / admission-bound / memory-bound / launch-bound
+
+G: serving intervention grammar
+   regime -> legal intervention families, not raw arbitrary knobs
+
+V: validator
+   tunable schema + topology constraints + no-repeat + failure memory + stop authority
+
+M: measurement/scoring protocol
+   SLO-constrained feasible frontier, req/s/GPU, latency quantiles, pass-rate guard
 ```

-因此主 claim 不写成“已经完整验证所有 serving engines”。更稳妥的表述是：
+Planner 是可替换的：

- AITuner 的 control loop 使用 engine adapter 抽象：launch recipe、healthcheck、
-  OpenAI-compatible request API、engine-specific flag/env mapping、topology constraints。
- 当前实验集中在 vLLM 上，因为 vLLM case 足以完整证明 harness workflow 是否有效。
- 不同 serving engine 的兼容性作为 architecture portability 论证；只有补充
-  SGLang 等 engine 的 adapter 和至少一个验证 case 后，才升级为 evaluated claim。
+```text
+pi in {LLM, BO, bandit, deterministic heuristic, tree search}
+```

-## 设计点
+AITuner 的强 claim 应该是：同一个 planner 放在 harness-shaped space 里，比放在
+raw knob space 里更快、更稳、更接近最优；弱模型或非 LLM planner 也能从这个 substrate
+中获益。

-| 设计点 | 作用 | 需要证明的性质 |
-| --- | --- | --- |
-| Observation | 把 config、probe history、SLO failure、latency profile、incumbent、failed signatures 结构化 | LLM 看到的是可计算状态，不只是自然语言日志 |
-| Bottleneck classifier | 把 TTFT/prefill、decode TPOT、admission/queueing、memory/launch failure 分开 | proposal 方向和测量瓶颈一致 |
-| Candidate family | 将 bottleneck 映射到 topology/runtime/cache/admission knob family | 搜索空间被压缩，但不写死单个 case |
-| SLO-constrained scoring | 用 `max feasible req/s/GPU` 评价 candidate | 优化目标和生产 SLO 一致，不追 raw throughput |
-| Validator / stop | 阻止非法、重复、失败 family；在 search high saturated 或无有效候选时停止 | 减少 GPU burn，同时避免过早停止 |
+## Why not pure white-box
+
+我们不应 claim 完整 white-box optimization。AITuner 没有解析 vLLM scheduler、
+kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的表述是 grey-box：
+
+- objective 仍然由真实测量决定；
+- action space、constraints、failure attribution 和 intervention semantics 是系统知识驱动；
+- 每个 trial 是一个 counterfactual experiment，而不是盲目采样一个 knob vector。
+
+## 关键设计点
+
+| 设计点 | 更强表述 | 作用 | 需要证明 |
+| --- | --- | --- | --- |
+| Observation | mechanism state | 将 workload shape、probe trace、SLO failure、launch/memory failure 结构化 | agent 看到的是可计算状态，不是自然语言日志 |
+| Bottleneck classifier | SLO violation attribution | 把失败归因到 serving regime，而不是只看哪个指标超阈值 | attribution 和后续有效 intervention 有因果关联 |
+| Candidate family | serving intervention grammar | 把 raw knobs 提升为 topology / batching / admission / memory interventions | 搜索空间被压缩，但不写死某个 case |
+| Scoring | counterfactual verdict | 用 SLO frontier 和 req/s/GPU 判断 intervention 是否支持假设 | 最终 winner 由测量决定，不由 LLM 决定 |
+| Validator / stop | fail-safe control | 禁止非法、重复、已知失败 family；只有 validator 授权 stop | 错误 attribution 最多浪费 trial，不污染 incumbent |

 ## Claim roadmap

-| Claim | 当前状态 | 证据文档 | 缺口 |
+| Claim | 当前状态 | 证据文档 | 关键缺口 |
 | --- | --- | --- | --- |
-| Harness 比 naive 收敛更快、上限更高 | 已有强证据 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 补齐 Qwen235B decode 2x2 aggregate |
-| Harness 不是强模型本身带来的收益 | 已有一个完整 2x2；第二个正在跑 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen235B prefill progress](harness-ablation/qwen235b-prefill-2x2-progress-20260623.md) | 完成 Qwen235B decode 2x2；更新 prefill final doc |
-| Harness 对 SLO 变化 robust | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选择一个第二 case 做 SLO sweep，而不是立即铺很多 case |
-| Harness 找到的是合理/near-optimum config | 部分解释已有，严格证据不足 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md), [AITuner summary](aituner-harness-summary.md) | 对 1-2 个 case 做局部 grid 或专家配置对照 |
-| Harness 的每个组件都有贡献 | 设计解释已有，组件 ablation 不足 | [AITuner summary](aituner-harness-summary.md) | 做 no-classifier / no-family / no-validator / no-stop ablation |
-| Workflow 可通过 engine adapter 迁移到其他 serving engine | 设计上可行，暂不作为主实验 claim | 当前 `EngineLaunchSpec` / launch recipe 抽象 | vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case |
+| C1. Harness 将 raw knob search 转成 mechanism-guided intervention search，提升固定预算优化效果 | 已有强信号 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 补 Qwen235B decode 2x2 aggregate；补 mechanism ablation |
+| C2. 收益来自 harness-defined substrate，不依赖某个强 LLM | 部分已有 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md) | 做 `BO/heuristic + harness` vs `BO/heuristic + raw knobs` |
+| C3. Weak planner + harness 可以匹配或超过 strong LLM naive | Qwen27B 已支持；Qwen235B 正在补 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen235B prefill progress](harness-ablation/qwen235b-prefill-2x2-progress-20260623.md) | 完成 Qwen235B decode 2x2；更新 prefill final doc |
+| C4. Attribution 和 intervention grammar 有机制贡献，不只是 prompt 信息更多 | 设计已有，严格证据不足 | [AITuner summary](aituner-harness-summary.md) | 做 shuffled attribution / no attribution / no grammar / no topology-first / no validator ablation |
+| C5. AITuner 找到 near-optimal region，而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
+| C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep；同时画 SLO tightness -> frontier/regime transition |
+| C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行，暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case |

-## 当前最高优先级
+## 最高优先级实验

-1. 完成并整理 Qwen235B decode 2x2。
-   目标：回答 `weak model + harness` 是否能超过 `strong model + naive`。
+### P0. 完成 Qwen235B decode 2x2 并整理 aggregate

-2. 更新 Qwen235B prefill 2x2 final 文档。
-   目标：避免 roadmap 指向过期 progress 文档。
+目的：补齐最核心的 `harness on/off x strong/weak planner` 证据，回答：

-3. 选定一个 near-optimum 证明 case。
-   目标：用小规模 grid 或专家配置对照证明 harness 找到的是合理最优区间，而不是
-   prompt coincidence。
+```text
+weak LLM + harness >= strong LLM naive ?
+```

-4. 决定第二个 SLO robustness case。
-   目标：证明 Qwen30B 以外也成立，但先不要盲目铺实验。
+预期产出：

-5. 设计 engine adapter 迁移实验，但暂缓执行。
-   目标：在 vLLM 证据链完整后，用一个 SGLang-style case 证明适配成本低，而不是提前分散主线。
+- 2x2 表格：每个 arm 在相同 iter budget 下的 best-so-far req/s/GPU；
+- convergence curve / normalized AUC；
+- 每个 arm 的 trial path 和主要 config patches；
+- 解释 naive 为什么走错，harness 如何通过 regime attribution 走到正确 intervention。
+
+优先级原因：实验已经在跑，增量成本最低，而且直接支撑 C1/C3。
+
+### P1. Planner-agnostic substrate 实验
+
+目的：证明 AITuner 不是 LLM tuner，而是 harness-defined optimization substrate。
+
+最小实验矩阵：
+
+| Planner | Raw knob space | Harness intervention space |
+| --- | --- | --- |
+| deterministic heuristic | raw heuristic | harness policy |
+| BO 或 lightweight bandit | raw BO | harness-guided BO |
+| weak LLM | naive weak LLM | weak LLM + harness |
+| strong LLM | naive strong LLM | strong LLM + harness |
+
+如果 BO 实现成本高，先用 deterministic harness policy 做 non-LLM planner baseline：
+它已经能证明“没有 LLM 也能 work”。随后再补 BO，使论证更强。
+
+预期图：
+
+- x-axis: trial budget；
+- y-axis: best-so-far SLO-constrained req/s/GPU；
+- line groups: raw knob space vs harness intervention space；
+- 单独 bar：invalid launch rate、repeated config rate、wasted trial rate。
+
+优先级原因：这是新 framing 的关键实验。没有它，paper 仍然容易被读成“LLM prompt
+engineering”。
+
+### P2. Mechanism ablation
+
+目的：证明 harness 内部不是普通信息堆叠，而是 attribution、intervention grammar、
+validator 分别贡献有效机制。
+
+建议 ablation：
+
+| Variant | 删除/破坏什么 | 预期证明 |
+| --- | --- | --- |
+| full AITuner | 无 | 最好 |
+| no attribution | 不提供 regime attribution，只给 scalar score 和历史结果 | attribution 对方向选择有贡献 |
+| shuffled attribution | 故意打乱 regime label，但保留文本长度 | 收益来自语义正确性，不是更多 prompt tokens |
+| no intervention grammar | 允许任意 tunable knobs，移除 family guidance | action-space shaping 有贡献 |
+| no topology-first | runtime knobs 可以优先于 topology intervention | topology 是 LLM serving 的一阶决策 |
+| no validator/failure memory | 允许重复、已知 launch failure family | fail-safe control 减少 GPU burn |
+
+预期图：
+
+- mechanism ablation bar：final best、AUC、TTT；
+- waste breakdown：invalid launch、repeat config、wrong-family trial；
+- case study trace：每个 variant 前 3-5 个 proposal 对比。
+
+优先级原因：这是回应 novelty 质疑的核心证据。
+
+### P3. Near-optimum / expert baseline 证据
+
+目的：证明 AITuner 不是只找到“能收敛但性能差”的 config。
+
+优先选择一个成本可控 case 做局部 grid：
+
+```text
+topology: TP/DP frontier
+runtime: max-num-seqs, max-num-batched-tokens, gpu-memory-utilization 的小邻域
+objective: max feasible req/s/GPU under pass_rate >= 0.95
+```
+
+预期图：
+
+- local grid heatmap；
+- AITuner trial path overlay；
+- AITuner best vs grid best vs expert config；
+- near-optimum gap，例如 `AITuner >= 95% of local grid optimum`。
+
+优先级原因：这是 claim “tune 出最好的 config，而不是差的收敛 config” 的必要证据。
+
+### P4. 第二个 SLO robustness case
+
+目的：证明 Qwen30B 的 SLO robustness 不是单 case 现象。
+
+不要先大规模铺 sweep。先选一个和 Qwen30B 机制不同的 case：
+
+- 一个 decode-heavy case，观察 TP/DP redistribution 和 concurrency/memory intervention；
+- 或一个 long-prefill / tight-TTFT case，观察 TP 和 prefill batching intervention。
+
+预期图：
+
+- x-axis: SLO tightness；
+- y-axis: best feasible req/s/GPU；
+- marker/color: selected intervention regime；
+- annotation: final TP/DP/MNS/MBT；
+- 展示 SLO 放宽时 frontier/right shift 或 regime transition。
+
+优先级原因：重要，但应排在 planner-agnostic 和 mechanism ablation 之后。
+
+### P5. SGLang / multi-engine adapter validation
+
+目的：证明 intervention grammar 可以通过 adapter lowering 到不同 serving engine。
+
+当前暂缓，不作为 vLLM 主线之前的高优先级实验。等 C1-C5 稳定后再做一个低成本 case：
+
+```text
+same workload profile
+same SLO objective
+same intervention grammar
+different engine adapter
+```
+
+优先级原因：它能扩展 generality，但不能替代 vLLM 主线的机制证明。

 ## 暂不做

- 暂不同时开启大量 SLO sweep；先对齐 paper frame 和最高价值 claim。
- 暂不把所有旧文档重写成中文；只对新证据文档和 roadmap 使用中文，旧文档按需要逐步迁移。
- 暂不 claim 全局最优；没有 grid/expert baseline 前，只 claim near-optimum evidence pending。
- 暂不把 multi-engine support 放进主 claim；先写成 adapter-based design，等 vLLM
-  机制证据完整后再补 SGLang validation。
+- 暂不把主 claim 写成“LLM 比 BO 更聪明”。新 claim 是 harness substrate 对多种 planner
+  都有用。
+- 暂不 claim full white-box 或全局最优。当前更稳妥的是 grey-box、near-optimum、
+  fixed-budget utility。
+- 暂不横向铺大量 SLO sweep。先补机制 ablation、planner-agnostic 和 near-optimum。
+- 暂不把 multi-engine support 放进主实验 claim。先写成 adapter-based design，等 vLLM
+  证据链完整后再补一个 SGLang validation。