gahow/aituner

Fork 0

Files

Gahow Wang 08429e5da8 Refine harness design flow overview

2026-06-29 20:41:54 +08:00

13 KiB

Raw Permalink Blame History

AITuner Harness Design Contract

本文总结当前 AITuner harness 的设计语义。它不是实验流水账，也不是最终论文文字；它的作用是把我们能 claim 的系统贡献、各模块做法、隐含假设和限制说清楚。

核心结论：

AITuner harness 的贡献不是“LLM 会调参”，也不是“写了一组专家 if/else 规则”。

Harness 的目标是把 black-box knob search 转成：
measurement-grounded, mechanism-guided, validator-controlled experiments。

换句话说，planner 可以是 LLM、BO、bandit、deterministic heuristic 或人工选择。 Harness 负责把观测转换成可审计的机制假设，生成合法候选，并用真实测量验证或否定这些假设。

核心状态机

flowchart LR
    S[State<br/>workload, constraints, history] --> E[Evidence<br/>SLO symptoms to mechanism signals]
    E --> C[CandidateSet<br/>typed interventions]
    C --> V[Validator<br/>legal, novel, covered?]
    V -->|run trial| M[Measurement<br/>verdict]
    M --> S
    V -->|no justified candidate| X[Stop / report]

Harness 的核心循环只有五步：

State：维护 workload、SLO、engine/hardware constraints 和历史 trial measurement。
Evidence：把 probe 结果从 raw logs 转成 serving-stage symptom signals。
CandidateSet：在 mechanism space 中生成有限个 typed interventions。
Validator：检查 legality、full-config novelty、failure memory 和 coverage。
Measurement：执行被验证过的 intervention，用真实 SLO verdict 更新状态；若没有 justified candidate，则 stop 或报告 measurement/coverage gap。

这个状态机表达的是 harness 的最小设计，不依赖具体 planner。LLM、BO、bandit 或 deterministic heuristic 都只能在 CandidateSet 上排序或选择，不能绕过 Validator 直接构造 config，也不能单方面决定 stop。

核心设计不变量

后续所有低层模块都服务于三个不变量：

不变量	含义	为什么重要
Measurement-grounded	每个状态转移都由真实 probe/SLO verdict 更新	防止 planner 把自然语言猜测当成事实
Mechanism-typed	候选不是裸 knob vector，而是 topology/scheduler/admission/cache 等 intervention	降低搜索维度，并让每个 trial 有可解释假设
Validator-controlled	candidate 和 stop 必须通过 legality、no-repeat、coverage 和 failure guards	防止重复实验、非法配置和 premature stop

从 High Level 到 Low Level 的展开

下面各节按实现层次展开：

Observation schema 定义 harness 能看到什么；
Evidence compiler 说明 symptom 如何变成机制证据；
Mechanism space 说明候选空间从哪里来；
CandidateSet 说明如何构造 intervention；
Planner interface 说明 LLM/BO/heuristic 的边界；
Validator 说明什么能执行、什么能停止。

每一层都区分两件事：当前 prototype 的具体做法，以及这些做法的假设和限制。

详细模块语义

1. Observation Schema

Harness 先把一次 trial/probe 的结果转成结构化 observation：

O_t = {
  workload summary,
  SLO rules,
  effective engine config,
  best feasible probe,
  limiting probe,
  failed_reason_counts,
  early_stop_reason,
  pass_rate,
  request_rate_per_gpu,
  launch / OOM status
}

其中 failed_reason_counts 的定义是 request-level SLO violation reason 的 multiset 计数：

ttft_ms>threshold       request 的 TTFT 超 SLO
tpot_ms>threshold       request 的 TPOT 超 SLO
arrival_lag_s>limit     synthetic arrivals 已经追不上
probe_elapsed_s>limit   probe 总耗时超过上限
slo_pass_rate_unrecoverable
                        已失败过多，数学上无法达到 target pass rate
request_failed / timeout / completion mismatch
                        请求级失败

重要限制：

一个 request 可以同时贡献 ttft_ms>... 和 tpot_ms>...；
failed_reason_counts 是 symptom evidence，不是 root-cause ground truth；
queueing/admission 主要来自 probe 调度层 early stop，而不是单个 request latency 的精确分解。

因此文档和论文里必须避免说“failed count 证明 root cause 是 TTFT/TPOT”。更准确的说法是：

failed_reason_counts gives SLO violation symptoms.
Harness infers serving-stage hypotheses from these symptoms.

2. Evidence Compiler / Bottleneck Hypotheses

当前 prototype 做两层聚合。

第一层是单个 probe 的 active bottleneck。当前实现用 count-majority：

ttft_count = sum(count(reason startswith "ttft"))
tpot_count = sum(count(reason startswith "tpot"))
other_request_failed_count = non-TTFT, non-TPOT request failures

if ttft_count >= max(tpot_count, other_request_failed_count):
    active = ttft_prefill
elif tpot_count >= max(ttft_count, other_request_failed_count):
    active = decode_tpot
else:
    active = admission_or_queueing

这一步是 heuristic。它的语义基础是 TTFT/TPOT/arrival-lag 对应不同 serving stages，但 “majority label = root cause” 并不成立。

第二层是跨 trial 的 ranked hypotheses。它把以下证据合成 score：

workload prior：decode-only + TPOT SLO 更支持 decode hypothesis；长 prompt tail 更支持 prefill hypothesis；
latest probe active label；
historical probe evidence；
failed_reason_counts 中 TTFT/TPOT/queueing symptom ratio；
launch failure / OOM。

更稳健的目标设计应该把第一层 hard label 改成 soft evidence vector：

e_prefill   = normalized count of TTFT symptoms
e_decode    = normalized count of TPOT symptoms
e_admission = normalized count of arrival lag / elapsed / unrecoverable / request failures
e_memory    = launch or OOM evidence

Candidate generator 应该基于 evidence distribution 生成 top mechanism probes，而不是只相信一个 hard dominant bottleneck。当前 prototype 的 hard label 是工程近似，不是最终 contribution。

3. Mechanism Space

Harness 不在 raw knob Cartesian product 中盲搜：

raw space = {TP, DP, EP, GMU, MNS, MBT, chunked-prefill, ...}

它先把 knobs 映射到 serving pipeline 上的可控 mechanism：

Mechanism family	Example knobs	机制含义	典型 evidence
Topology / resource partition	TP, DP, EP, visible GPUs	改变 compute/memory 分布、replica 数、per-GPU efficiency	TTFT/TPOT pressure, topology frontier 未覆盖
Prefill scheduler	chunked prefill, MBT	改变 prefill quantum 和 head-of-line blocking	TTFT symptoms, long prompt tail, low prefix reuse
Admission / concurrency	MNS	改变活跃 sequence 数和 batch/admission pressure	arrival lag, pass-rate unrecoverable, concurrency underuse
KV/cache headroom	GMU, block/cache knobs	改变 KV cache blocks 和 memory feasibility	cache pressure, launch/memory, topology settled 后仍有 SLO pressure
Launch/memory feasibility	env, memory-affecting flags	确认 engine 是否能启动、是否 OOM	launch failure, OOM
Frontier delta transfer	measured runtime delta applied to other Pareto anchors	将已测 runtime 改动投影到未测 frontier anchor	同 topology 上 runtime delta 为正，且存在其他 Pareto anchor

这些 family 的依据不是某个 case 的 winning config，而是 LLM serving pipeline：

arrival/admission -> prefill -> decode -> memory/launch feasibility

每个 family 必须满足三条约束：

它对应一个可解释的 serving mechanism；
它只生成 engine schema 和 hardware constraints 下合法的 candidate；
它的 confirm/reject condition 由真实 measurement 决定。

限制：

Mechanism family 是 domain-specific，不是 engine-agnostic magic；
vLLM/SGLang 等 engine 的 knob 名称不同，需要 adapter 把 engine knobs 映射到同一 mechanism vocabulary；
family 本身有系统依据，但当前 score 常数和部分 gate 仍是 heuristic。

4. CandidateSet / Intervention Generation

Candidate 不是“一个 patch”这么简单。一个合法 candidate 应包含：

candidate = {
  mechanism_family,
  config_patch,
  hypothesis,
  expected_effects,
  confirm_condition,
  reject_condition,
  effective_full_config_signature
}

当前 prototype 的候选顺序大致是：

topology candidates；
frontier-delta projection candidates；
runtime candidates。

其中 runtime candidates 又包含 prefill scheduler、MBT、MNS、GMU 等 family。

设计假设：

topology/resource partition 通常改变较大的 capacity frontier；
runtime knobs 通常是同一 topology 下的 local refinement；
当 topology frontier 未覆盖时，过早 runtime hill-climbing 可能把搜索困在坏 topology；
当一个 runtime delta 已在某个 topology 上测得正收益时，把这个 delta 投影到其他 Pareto anchor 是比完整 factorial grid 更便宜的 interaction test。

限制：

topology-before-runtime 是强 prior，不是定理；需要 ablation；
frontier delta transfer 依赖已测 history，如果 history 太少就不能工作；
当前 prototype 中一些 target step 和 score 常数仍然是人工 heuristic。

5. Planner Interface

Planner 的职责应该被限制为：

rank/select candidate from CandidateSet
explain why this candidate is worth the next trial

Planner 不应该：

构造 schema 外的 knob；
绕过 topology / memory constraints；
重复已经测试过的 effective full config；
单方面决定 stop；
把自然语言猜测当成 measurement verdict。

这也是 no-LLM harness 能工作的原因：只要 CandidateSet 和 Validator 足够有信息，一个 deterministic planner 也可以完成 tuning。LLM 的价值在于组合 evidence、解释 tradeoff、在候选较多时排序，而不是提供 tuning correctness 的唯一来源。

6. Validator / Stop Authority

Validator 是 harness 防止 prompt engineering 化的关键。它负责：

canonicalize effective full config；
拒绝 no-op 或 repeat；
检查 legal topology / visible GPU / tunable schema；
记录 failure memory；
判断 measurement ceiling，例如 search.high 是否不足；
在 candidate coverage 不足时禁止 premature stop；
只有在覆盖和 measurement guards 都满足时授权 stop。

重要设计修正：

no-repeat must use normalized effective full-config signature,
not patch signature.

因为 runtime-only patch 在 materialization 时会继承 incumbent topology。如果只看 patch signature，可能把 {"gmu": 0.9} 误认为新 config，但真实执行时它可能 materialize 成已测过的 full config。

限制：

Validator 只能保证相对于声明的 grammar/operator set 的 coverage；
它不能证明全 raw knob space 没有更优点；
measurement ceiling 不足时应报告并请求人类确认，而不是静默合成 arrivals 或重复窗口。

精确贡献表述

我们应该 claim：

AITuner introduces a planner-agnostic harness that converts LLM serving
configuration tuning from black-box knob search into typed, measurement-grounded
counterfactual experiments over serving mechanisms.

可拆成三点贡献：

Serving-stage evidence compiler 将 workload profile、SLO violation symptoms、probe early stop 和 launch failure 转换为 prefill/decode/admission/memory/launch 的机制证据，而不是只给 planner 一个 scalar score。
Typed mechanism action space 将 raw knobs 组织为 topology、prefill scheduler、admission/concurrency、cache headroom、 frontier transfer 等 intervention families，使搜索发生在 mechanism space 而不是任意 knob vector space。
Validator-controlled experimental loop 用 full-config signature、constraints、failure memory、coverage 和 measurement guards 控制 proposal 与 stop，使 LLM/BO/heuristic 都只能在合法、可审计的 candidate set 上工作。

我们不应该 claim：

bottleneck classifier 永远正确；
failed_reason_counts 是 root cause label；
当前 heuristic score 常数有理论最优性；
harness 覆盖完整 raw knob space；
stop 证明全局最优；
某个 case 的 winning config 被系统“证明”出来。

必须补的证据

为了证明贡献不是 rule accumulation，后续实验必须 ablate family 和 authority，而不是只报最终性能：

Ablation	证明什么
classifier off / shuffled evidence	evidence attribution 是否真的影响正确方向
mechanism space off，改用 raw random/BO	mechanism action space 是否压缩搜索并提升收敛
topology-before-runtime off	大 frontier intervention prior 是否必要
frontier-delta projection off	cross-topology runtime transfer 是否解决 bad-start/local trap
validator off / patch signature only	full-config validator 是否避免重复和 false progress
no-LLM deterministic planner	harness 是否是 planner-agnostic substrate
weak planner + harness vs strong planner naive	harness 是否能补偿 planner 能力差距

最终论文表达应保持这个边界：

Harness makes the search more structured, auditable, and measurement-efficient.
It does not replace measurement, does not prove global optimality, and does not
turn symptom labels into perfect causal diagnosis.

13 KiB Raw Permalink Blame History Unescape Escape