aituner/docs/aituner-roadmap.md

# AITuner roadmap

本文只维护最小 roadmap：paper framing、claim 树、已有证据、最高优先级实验。
详细实验流水账放到对应专题文档里。

## Paper thesis

AITuner 的核心不是“用 LLM 调参”。更准确的 framing 是：

```text
black-box knob optimization
  -> grey-box / mechanism-guided experimental optimization
```

也就是说，AITuner 仍然通过真实实验测量目标函数，但它不再把 serving engine 当成
完全黑盒的 `config vector -> scalar score`。Harness 将 workload、SLO failure、
probe trace、topology constraints 和 failure memory 转换成结构化的 serving
mechanism state，并把下一步搜索限制在可解释、可验证的 intervention 上。

因此 LLM 不是不可替代的核心。LLM 是 planner backend / copilot；核心系统贡献是
planner-agnostic 的 tuning substrate：

```text
Harness H = (O, R, G, V, M)

O: observation schema
   workload L/C/A profile + probe trace + latency/SLO failure + launch status

R: regime attribution
   SLO violation -> prefill-bound / decode-bound / admission-bound / memory-bound / launch-bound

G: serving intervention grammar
   regime -> legal intervention families, not raw arbitrary knobs

V: validator
   tunable schema + topology constraints + no-repeat + failure memory + stop authority

M: measurement/scoring protocol
   SLO-constrained feasible frontier, req/s/GPU, latency quantiles, pass-rate guard
```

当前 `src/aituner/harness.py` 是 prototype：它已经展示 no-LLM loop 和 mechanism-guided
proposal 的可行性，但仍然包含大量 rule-based heuristics，不能作为最终 harness
contribution。新的目标设计见 [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md)。

Planner 是可替换的：

```text
pi in {LLM, BO, bandit, deterministic heuristic, tree search}
```

AITuner 的强 claim 应该是：同一个 planner 放在 harness-shaped space 里，比放在
raw knob space 里更快、更稳、更接近最优；弱模型或非 LLM planner 也能从这个 substrate
中获益。

## Why not pure white-box

我们不应 claim 完整 white-box optimization。AITuner 没有解析 vLLM scheduler、
kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的表述是 grey-box：

- objective 仍然由真实测量决定；
- action space、constraints、failure attribution 和 intervention semantics 是系统知识驱动；
- 每个 trial 是一个 counterfactual experiment，而不是盲目采样一个 knob vector。

## 关键设计点

当前 harness 设计语义、模块假设和限制见
[AITuner Harness Design Contract](aituner-harness-design-contract.md)。Roadmap 只维护
claim 和实验优先级；design contract 负责精确定义我们能说什么、不能说什么。

| 设计点 | 更强表述 | 作用 | 需要证明 |
| --- | --- | --- | --- |
| Observation | mechanism state | 将 workload shape、probe trace、SLO failure、launch/memory failure 结构化 | agent 看到的是可计算状态，不是自然语言日志 |
| Bottleneck classifier | SLO violation attribution | 把失败归因到 serving regime，而不是只看哪个指标超阈值 | attribution 和后续有效 intervention 有因果关联 |
| Candidate family | serving intervention grammar | 把 raw knobs 提升为 topology / batching / admission / memory interventions | 搜索空间被压缩，但不写死某个 case |
| Scoring | counterfactual verdict | 用 SLO frontier 和 req/s/GPU 判断 intervention 是否支持假设 | 最终 winner 由测量决定，不由 LLM 决定 |
| Validator / stop | fail-safe control | 禁止非法、重复、已知失败 family；只有 validator 授权 stop | 错误 attribution 最多浪费 trial，不污染 incumbent |

## Claim roadmap

| Claim | 当前状态 | 证据文档 | 关键缺口 |
| --- | --- | --- | --- |
| C1. Harness 将 raw knob search 转成 mechanism-guided intervention search，提升固定预算优化效果 | 已有强信号 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 补 Qwen235B decode 2x2 aggregate；补 mechanism ablation |
| C2. 收益来自 harness-defined substrate，不依赖某个强 LLM | Qwen30B no-LLM 已完整闭环；Qwen27B 弱/强模型已有 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md) | 做 `BO/heuristic + harness` vs `BO/heuristic + raw knobs` |
| C3. Weak planner + harness 可以匹配或超过 strong LLM naive | Qwen27B 已支持；Qwen235B 正在补 | [Qwen27B 2x2](harness-ablation/qwen27b-tight-2x2-model-ablation-20260623.md), [Qwen235B prefill progress](harness-ablation/qwen235b-prefill-2x2-progress-20260623.md) | 完成 Qwen235B decode 2x2；更新 prefill final doc |
| C4. Attribution 和 intervention grammar 有机制贡献，不只是 prompt 信息更多 | 设计和 no-LLM case 已整理；严格 ablation 不足 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md), [AITuner summary](aituner-harness-summary.md) | 做 shuffled attribution / no attribution / no grammar / no topology-first / no validator ablation |
| C5. AITuner 找到 near-optimal region，而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
| C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep；同时画 SLO tightness -> frontier/regime transition |
| C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行，暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case |
| C8. Harness 对坏初始点有恢复能力，不只依赖可信 base config | 单个 adversarial bad-start 已通过 first repair；分布级 robustness 不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [Bad-start robustness suite](harness-ablation/bad-start-robustness-suite-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 按 pre-registered 20-case suite 跑 random/adversarial start distribution |

## 最高优先级实验

### P0a. Declarative harness redesign gate

目的：停止继续向 `harness.py` 添加 testcase-specific rules，把 harness 重构成
declarative intervention grammar + coverage-relative validator。

最低验收：

- CandidateSet 完整枚举并持久化 snapshot；
- CandidateSet v1 先限定为当前 harness generator 实际构造出的 concrete candidates，
  不 claim 全 Cartesian knob space 枚举；`candidate_set_hash`、eligible/blocked
  records 和 blocked reason summary 已在 harness context 与 `harness/candidate-set-*.json`
  sidecar 中实现；
- `harness_priority` 与 backend ranking 分离；
- CoverageUnit 结构化，stop 不能只依赖 exact signature；
- `search_high_saturated_by_incumbent` 不能绕过 CandidateSet coverage；对 `req/s/GPU`
  目标，未覆盖 topology/resource-efficiency contrast 时必须继续；
- 加入 `auto_search_high` measurement policy：可在已有 trace 内自动提高 ceiling；若
  `search.high=1.0` 仍然不足，必须报告 `measurement_ceiling_insufficient` 并等待人类
  确认，不得静默重复窗口或合成 arrivals；
- normalized full-config signature：no-repeat 不能只看 patch signature；base config 与
  no-op patch 必须被识别为同一 full config；`48911b6` 已实现并在 dash1 bad-start
  validation 中通过；
- materialized effective signature：runtime-only proposal 必须先按真实执行路径继承
  incumbent topology，再做 no-repeat；已加入 shared signature/canonicalization，并在
  CLI 进入 trial 前 hard-veto 重复 LLM/manual/harness proposal；
- Failure invalidation 有保守 region predicate 和 retry/unblock 条件；
- grammar/policy/capability 都有 version 和 anti-overfitting static checks；
- LLM/BO 只能选择合法 candidate，不能绕过 validator。

优先级原因：如果不先完成这个 gate，继续扩展 bad-start/SLO/2x2 实验只是在证明一个
rule-based prototype。

### P0b. Bad-start recovery confirmation after redesign

目的：回答 harness 是否只能从可信 base config 起步，还是能从明显不合理的初始 config
恢复到正确方向。

Pre-registered suite 见 [Bad-start robustness suite](harness-ablation/bad-start-robustness-suite-20260626.md)。

最小实验矩阵：

| Case | 初始配置 | 证明点 |
| --- | --- | --- |
| bad-topology | `TP=8, DP=1` | 高 TP 起点会先做相邻低 TP bracket |
| bad-runtime | `TP=2, gmu=0.5, max-num-seqs=8` | 低 runtime headroom 会跳回 nominal floor |
| combined-bad | `TP=8, gmu=0.5, max-num-seqs=8` | topology recovery 和 runtime recovery 能串联 |

注意：这不是先跑一条手工 bad case。必须在 declarative harness 之后跑 random/adversarial
start distribution，并报告分布结果。

预期图：

- x-axis: trial index；
- y-axis: best-so-far SLO-constrained req/s/GPU；
- line groups: trusted-start vs bad-start cases；
- annotation: proposal family sequence，例如 `TP downshift`, `gmu floor jump`, `gmu climb`。

启动条件：先完成 P0a；再确认 dash fleet 有空闲 8xH20 机器；用户确认后再开跑。

### P0c. 完成 Qwen235B decode 2x2 并整理 aggregate

目的：补齐最核心的 `harness on/off x strong/weak planner` 证据，回答：

```text
weak LLM + harness >= strong LLM naive ?
```

预期产出：

- 2x2 表格：每个 arm 在相同 iter budget 下的 best-so-far req/s/GPU；
- convergence curve / normalized AUC；
- 每个 arm 的 trial path 和主要 config patches；
- 解释 naive 为什么走错，harness 如何通过 regime attribution 走到正确 intervention。

优先级原因：实验已经在跑，增量成本最低，而且直接支撑 C1/C3。

### P1. Planner-agnostic substrate 实验

目的：证明 AITuner 不是 LLM tuner，而是 harness-defined optimization substrate。

最小实验矩阵：

| Planner | Raw knob space | Harness intervention space |
| --- | --- | --- |
| deterministic heuristic | raw heuristic | harness policy |
| BO 或 lightweight bandit | raw BO | harness-guided BO |
| weak LLM | naive weak LLM | weak LLM + harness |
| strong LLM | naive strong LLM | strong LLM + harness |

如果 BO 实现成本高，先用 deterministic harness policy 做 non-LLM planner baseline：
它已经能证明“没有 LLM 也能 work”。随后再补 BO，使论证更强。

预期图：

- x-axis: trial budget；
- y-axis: best-so-far SLO-constrained req/s/GPU；
- line groups: raw knob space vs harness intervention space；
- 单独 bar：invalid launch rate、repeated config rate、wasted trial rate。

优先级原因：这是新 framing 的关键实验。没有它，paper 仍然容易被读成“LLM prompt
engineering”。

### P2. Mechanism ablation

目的：证明 harness 内部不是普通信息堆叠，而是 attribution、intervention grammar、
validator 分别贡献有效机制。

建议 ablation：

| Variant | 删除/破坏什么 | 预期证明 |
| --- | --- | --- |
| full AITuner | 无 | 最好 |
| no attribution | 不提供 regime attribution，只给 scalar score 和历史结果 | attribution 对方向选择有贡献 |
| shuffled attribution | 故意打乱 regime label，但保留文本长度 | 收益来自语义正确性，不是更多 prompt tokens |
| no intervention grammar | 允许任意 tunable knobs，移除 family guidance | action-space shaping 有贡献 |
| no topology-first | runtime knobs 可以优先于 topology intervention | topology 是 LLM serving 的一阶决策 |
| no validator/failure memory | 允许重复、已知 launch failure family | fail-safe control 减少 GPU burn |

预期图：

- mechanism ablation bar：final best、AUC、TTT；
- waste breakdown：invalid launch、repeat config、wrong-family trial；
- case study trace：每个 variant 前 3-5 个 proposal 对比。

优先级原因：这是回应 novelty 质疑的核心证据。

### P3. Near-optimum / expert baseline 证据

目的：证明 AITuner 不是只找到“能收敛但性能差”的 config。

优先选择一个成本可控 case 做局部 grid：

```text
topology: TP/DP frontier
runtime: max-num-seqs, max-num-batched-tokens, gpu-memory-utilization 的小邻域
objective: max feasible req/s/GPU under pass_rate >= 0.95
```

预期图：

- local grid heatmap；
- AITuner trial path overlay；
- AITuner best vs grid best vs expert config；
- near-optimum gap，例如 `AITuner >= 95% of local grid optimum`。

优先级原因：这是 claim “tune 出最好的 config，而不是差的收敛 config” 的必要证据。

### P4. 第二个 SLO robustness case

目的：证明 Qwen30B 的 SLO robustness 不是单 case 现象。

不要先大规模铺 sweep。先选一个和 Qwen30B 机制不同的 case：

- 一个 decode-heavy case，观察 TP/DP redistribution 和 concurrency/memory intervention；
- 或一个 long-prefill / tight-TTFT case，观察 TP 和 prefill batching intervention。

预期图：

- x-axis: SLO tightness；
- y-axis: best feasible req/s/GPU；
- marker/color: selected intervention regime；
- annotation: final TP/DP/MNS/MBT；
- 展示 SLO 放宽时 frontier/right shift 或 regime transition。

优先级原因：重要，但应排在 planner-agnostic 和 mechanism ablation 之后。

### P5. SGLang / multi-engine adapter validation

目的：证明 intervention grammar 可以通过 adapter lowering 到不同 serving engine。

当前暂缓，不作为 vLLM 主线之前的高优先级实验。等 C1-C5 稳定后再做一个低成本 case：

```text
same workload profile
same SLO objective
same intervention grammar
different engine adapter
```

优先级原因：它能扩展 generality，但不能替代 vLLM 主线的机制证明。

## 暂不做

- 暂不把主 claim 写成“LLM 比 BO 更聪明”。新 claim 是 harness substrate 对多种 planner
  都有用。
- 暂不 claim full white-box 或全局最优。当前更稳妥的是 grey-box、near-optimum、
  fixed-budget utility。
- 暂不横向铺大量 SLO sweep。先补机制 ablation、planner-agnostic 和 near-optimum。
- 暂不把 multi-engine support 放进主实验 claim。先写成 adapter-based design，等 vLLM
  证据链完整后再补一个 SGLang validation。