gahow/aituner

Fork 0

Files

Gahow Wang 95ad124a1b Document auto search high policy

2026-06-26 19:53:30 +08:00

14 KiB

Raw Blame History

AITuner roadmap

本文只维护最小 roadmap：paper framing、claim 树、已有证据、最高优先级实验。详细实验流水账放到对应专题文档里。

Paper thesis

AITuner 的核心不是“用 LLM 调参”。更准确的 framing 是：

black-box knob optimization
  -> grey-box / mechanism-guided experimental optimization

也就是说，AITuner 仍然通过真实实验测量目标函数，但它不再把 serving engine 当成完全黑盒的 config vector -> scalar score。Harness 将 workload、SLO failure、 probe trace、topology constraints 和 failure memory 转换成结构化的 serving mechanism state，并把下一步搜索限制在可解释、可验证的 intervention 上。

因此 LLM 不是不可替代的核心。LLM 是 planner backend / copilot；核心系统贡献是 planner-agnostic 的 tuning substrate：

Harness H = (O, R, G, V, M)

O: observation schema
   workload L/C/A profile + probe trace + latency/SLO failure + launch status

R: regime attribution
   SLO violation -> prefill-bound / decode-bound / admission-bound / memory-bound / launch-bound

G: serving intervention grammar
   regime -> legal intervention families, not raw arbitrary knobs

V: validator
   tunable schema + topology constraints + no-repeat + failure memory + stop authority

M: measurement/scoring protocol
   SLO-constrained feasible frontier, req/s/GPU, latency quantiles, pass-rate guard

当前 src/aituner/harness.py 是 prototype：它已经展示 no-LLM loop 和 mechanism-guided proposal 的可行性，但仍然包含大量 rule-based heuristics，不能作为最终 harness contribution。新的目标设计见 Declarative intervention harness design。

Planner 是可替换的：

pi in {LLM, BO, bandit, deterministic heuristic, tree search}

AITuner 的强 claim 应该是：同一个 planner 放在 harness-shaped space 里，比放在 raw knob space 里更快、更稳、更接近最优；弱模型或非 LLM planner 也能从这个 substrate 中获益。

Why not pure white-box

我们不应 claim 完整 white-box optimization。AITuner 没有解析 vLLM scheduler、 kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的表述是 grey-box：

objective 仍然由真实测量决定；
action space、constraints、failure attribution 和 intervention semantics 是系统知识驱动；
每个 trial 是一个 counterfactual experiment，而不是盲目采样一个 knob vector。

关键设计点

设计点	更强表述	作用	需要证明
Observation	mechanism state	将 workload shape、probe trace、SLO failure、launch/memory failure 结构化	agent 看到的是可计算状态，不是自然语言日志
Bottleneck classifier	SLO violation attribution	把失败归因到 serving regime，而不是只看哪个指标超阈值	attribution 和后续有效 intervention 有因果关联
Candidate family	serving intervention grammar	把 raw knobs 提升为 topology / batching / admission / memory interventions	搜索空间被压缩，但不写死某个 case
Scoring	counterfactual verdict	用 SLO frontier 和 req/s/GPU 判断 intervention 是否支持假设	最终 winner 由测量决定，不由 LLM 决定
Validator / stop	fail-safe control	禁止非法、重复、已知失败 family；只有 validator 授权 stop	错误 attribution 最多浪费 trial，不污染 incumbent

Claim roadmap

Claim	当前状态	证据文档	关键缺口
C1. Harness 将 raw knob search 转成 mechanism-guided intervention search，提升固定预算优化效果	已有强信号	No-LLM harness mechanism, Qwen27B 2x2, Qwen30B SLO robustness	补 Qwen235B decode 2x2 aggregate；补 mechanism ablation
C2. 收益来自 harness-defined substrate，不依赖某个强 LLM	Qwen30B no-LLM 已完整闭环；Qwen27B 弱/强模型已有	No-LLM harness mechanism, Qwen27B 2x2	做 `BO/heuristic + harness` vs `BO/heuristic + raw knobs`
C3. Weak planner + harness 可以匹配或超过 strong LLM naive	Qwen27B 已支持；Qwen235B 正在补	Qwen27B 2x2, Qwen235B prefill progress	完成 Qwen235B decode 2x2；更新 prefill final doc
C4. Attribution 和 intervention grammar 有机制贡献，不只是 prompt 信息更多	设计和 no-LLM case 已整理；严格 ablation 不足	No-LLM harness mechanism, AITuner summary	做 shuffled attribution / no attribution / no grammar / no topology-first / no validator ablation
C5. AITuner 找到 near-optimal region，而不是只找到一个可行 config	Qwen30B 有解释性信号	Qwen30B SLO robustness	选 1-2 个 case 做局部 grid 或专家配置对照
C6. AITuner 能随 SLO tightness 移动到合适 frontier	Qwen30B 已完成	Qwen30B SLO robustness	再选一个非同质 case 做 SLO sweep；同时画 SLO tightness -> frontier/regime transition
C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine	设计上可行，暂不作为主实验 claim	`EngineLaunchSpec` / launch recipe / tunable schema	vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case
C8. Harness 对坏初始点有恢复能力，不只依赖可信 base config	当前发现反例，不能 claim	Declarative intervention harness design, Bad-start stop counterexample, No-LLM harness mechanism	重构为 grammar/operator + coverage-relative stop 后跑 random/adversarial start distribution

最高优先级实验

P0a. Declarative harness redesign gate

目的：停止继续向 harness.py 添加 testcase-specific rules，把 harness 重构成 declarative intervention grammar + coverage-relative validator。

最低验收：

CandidateSet 完整枚举并持久化 snapshot；
harness_priority 与 backend ranking 分离；
CoverageUnit 结构化，stop 不能只依赖 exact signature；
search_high_saturated_by_incumbent 不能绕过 CandidateSet coverage；对 req/s/GPU 目标，未覆盖 topology/resource-efficiency contrast 时必须继续；
加入 auto_search_high measurement policy：可在已有 trace 内自动提高 ceiling；若 search.high=1.0 仍然不足，必须报告 measurement_ceiling_insufficient 并等待人类确认，不得静默重复窗口或合成 arrivals；
Failure invalidation 有保守 region predicate 和 retry/unblock 条件；
grammar/policy/capability 都有 version 和 anti-overfitting static checks；
LLM/BO 只能选择合法 candidate，不能绕过 validator。

优先级原因：如果不先完成这个 gate，继续扩展 bad-start/SLO/2x2 实验只是在证明一个 rule-based prototype。

P0b. Bad-start recovery confirmation after redesign

目的：回答 harness 是否只能从可信 base config 起步，还是能从明显不合理的初始 config 恢复到正确方向。

最小实验矩阵：

Case	初始配置	证明点
bad-topology	`TP=8, DP=1`	高 TP 起点会先做相邻低 TP bracket
bad-runtime	`TP=2, gmu=0.5, max-num-seqs=8`	低 runtime headroom 会跳回 nominal floor
combined-bad	`TP=8, gmu=0.5, max-num-seqs=8`	topology recovery 和 runtime recovery 能串联

注意：这不是先跑一条手工 bad case。必须在 declarative harness 之后跑 random/adversarial start distribution，并报告分布结果。

预期图：

x-axis: trial index；
y-axis: best-so-far SLO-constrained req/s/GPU；
line groups: trusted-start vs bad-start cases；
annotation: proposal family sequence，例如 TP downshift, gmu floor jump, gmu climb。

启动条件：先完成 P0a；再确认 dash fleet 有空闲 8xH20 机器；用户确认后再开跑。

P0c. 完成 Qwen235B decode 2x2 并整理 aggregate

目的：补齐最核心的 harness on/off x strong/weak planner 证据，回答：

weak LLM + harness >= strong LLM naive ?

预期产出：

2x2 表格：每个 arm 在相同 iter budget 下的 best-so-far req/s/GPU；
convergence curve / normalized AUC；
每个 arm 的 trial path 和主要 config patches；
解释 naive 为什么走错，harness 如何通过 regime attribution 走到正确 intervention。

优先级原因：实验已经在跑，增量成本最低，而且直接支撑 C1/C3。

P1. Planner-agnostic substrate 实验

目的：证明 AITuner 不是 LLM tuner，而是 harness-defined optimization substrate。

最小实验矩阵：

Planner	Raw knob space	Harness intervention space
deterministic heuristic	raw heuristic	harness policy
BO 或 lightweight bandit	raw BO	harness-guided BO
weak LLM	naive weak LLM	weak LLM + harness
strong LLM	naive strong LLM	strong LLM + harness

如果 BO 实现成本高，先用 deterministic harness policy 做 non-LLM planner baseline：它已经能证明“没有 LLM 也能 work”。随后再补 BO，使论证更强。

预期图：

x-axis: trial budget；
y-axis: best-so-far SLO-constrained req/s/GPU；
line groups: raw knob space vs harness intervention space；
单独 bar：invalid launch rate、repeated config rate、wasted trial rate。

优先级原因：这是新 framing 的关键实验。没有它，paper 仍然容易被读成“LLM prompt engineering”。

P2. Mechanism ablation

目的：证明 harness 内部不是普通信息堆叠，而是 attribution、intervention grammar、 validator 分别贡献有效机制。

建议 ablation：

Variant	删除/破坏什么	预期证明
full AITuner	无	最好
no attribution	不提供 regime attribution，只给 scalar score 和历史结果	attribution 对方向选择有贡献
shuffled attribution	故意打乱 regime label，但保留文本长度	收益来自语义正确性，不是更多 prompt tokens
no intervention grammar	允许任意 tunable knobs，移除 family guidance	action-space shaping 有贡献
no topology-first	runtime knobs 可以优先于 topology intervention	topology 是 LLM serving 的一阶决策
no validator/failure memory	允许重复、已知 launch failure family	fail-safe control 减少 GPU burn

预期图：

mechanism ablation bar：final best、AUC、TTT；
waste breakdown：invalid launch、repeat config、wrong-family trial；
case study trace：每个 variant 前 3-5 个 proposal 对比。

优先级原因：这是回应 novelty 质疑的核心证据。

P3. Near-optimum / expert baseline 证据

目的：证明 AITuner 不是只找到“能收敛但性能差”的 config。

优先选择一个成本可控 case 做局部 grid：

topology: TP/DP frontier
runtime: max-num-seqs, max-num-batched-tokens, gpu-memory-utilization 的小邻域
objective: max feasible req/s/GPU under pass_rate >= 0.95

预期图：

local grid heatmap；
AITuner trial path overlay；
AITuner best vs grid best vs expert config；
near-optimum gap，例如 AITuner >= 95% of local grid optimum。

优先级原因：这是 claim “tune 出最好的 config，而不是差的收敛 config” 的必要证据。

P4. 第二个 SLO robustness case

目的：证明 Qwen30B 的 SLO robustness 不是单 case 现象。

不要先大规模铺 sweep。先选一个和 Qwen30B 机制不同的 case：

一个 decode-heavy case，观察 TP/DP redistribution 和 concurrency/memory intervention；
或一个 long-prefill / tight-TTFT case，观察 TP 和 prefill batching intervention。

预期图：

x-axis: SLO tightness；
y-axis: best feasible req/s/GPU；
marker/color: selected intervention regime；
annotation: final TP/DP/MNS/MBT；
展示 SLO 放宽时 frontier/right shift 或 regime transition。

优先级原因：重要，但应排在 planner-agnostic 和 mechanism ablation 之后。

P5. SGLang / multi-engine adapter validation

目的：证明 intervention grammar 可以通过 adapter lowering 到不同 serving engine。

当前暂缓，不作为 vLLM 主线之前的高优先级实验。等 C1-C5 稳定后再做一个低成本 case：

same workload profile
same SLO objective
same intervention grammar
different engine adapter

优先级原因：它能扩展 generality，但不能替代 vLLM 主线的机制证明。

暂不做

暂不把主 claim 写成“LLM 比 BO 更聪明”。新 claim 是 harness substrate 对多种 planner 都有用。
暂不 claim full white-box 或全局最优。当前更稳妥的是 grey-box、near-optimum、 fixed-budget utility。
暂不横向铺大量 SLO sweep。先补机制 ablation、planner-agnostic 和 near-optimum。
暂不把 multi-engine support 放进主实验 claim。先写成 adapter-based design，等 vLLM 证据链完整后再补一个 SGLang validation。

14 KiB Raw Blame History Unescape Escape

AITuner roadmap

Paper thesis

Why not pure white-box

关键设计点

Claim roadmap

最高优先级实验

P0a. Declarative harness redesign gate

P0b. Bad-start recovery confirmation after redesign

P0c. 完成 Qwen235B decode 2x2 并整理 aggregate

P1. Planner-agnostic substrate 实验

P2. Mechanism ablation

P3. Near-optimum / expert baseline 证据

P4. 第二个 SLO robustness case

P5. SGLang / multi-engine adapter validation

暂不做

14 KiB

Raw Blame History