From 00ba573631b8da3989a8e90f113f714a6c8d764d Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Mon, 29 Jun 2026 20:26:58 +0800
Subject: [PATCH] Document harness design contract

---
 docs/aituner-harness-design-contract.md | 307 ++++++++++++++++++++++++
 docs/aituner-roadmap.md                 |   4 +
 2 files changed, 311 insertions(+)
 create mode 100644 docs/aituner-harness-design-contract.md
diff --git a/docs/aituner-harness-design-contract.md b/docs/aituner-harness-design-contract.md
new file mode 100644
index 0000000..22ea509
--- /dev/null
+++ b/docs/aituner-harness-design-contract.md
@@ -0,0 +1,307 @@
+# AITuner Harness Design Contract
+
+本文总结当前 AITuner harness 的设计语义。它不是实验流水账，也不是最终论文文字；
+它的作用是把我们能 claim 的系统贡献、各模块做法、隐含假设和限制说清楚。
+
+核心结论：
+
+```text
+AITuner harness 的贡献不是“LLM 会调参”，也不是“写了一组专家 if/else 规则”。
+
+Harness 的目标是把 black-box knob search 转成：
+measurement-grounded, mechanism-guided, validator-controlled experiments。
+```
+
+换句话说，planner 可以是 LLM、BO、bandit、deterministic heuristic 或人工选择。
+Harness 负责把观测转换成可审计的机制假设，生成合法候选，并用真实测量验证或否定这些假设。
+
+## 核心流程图
+
+```mermaid
+flowchart TD
+    A[StudySpec<br/>engine schema, tunable knobs, hardware, SLO] --> B[Run trial / probe]
+    W[Workload window<br/>prompt length, output length, arrivals, prefix/cache hints] --> C
+    B --> C[Observation schema<br/>effective config, probe result, SLO violations, launch status]
+    C --> D[Evidence compiler<br/>symptom evidence over serving stages]
+    D --> E[Mechanism hypotheses<br/>prefill, decode, admission, memory, launch]
+    E --> F[Mechanism action families<br/>topology, scheduler, concurrency, cache, frontier transfer]
+    F --> G[CandidateSet<br/>legal patches + hypotheses + expected effects]
+    G --> H[Planner backend<br/>LLM / BO / heuristic ranks candidates]
+    H --> I[Validator + materializer<br/>constraints, no-repeat full config, failure memory, stop authority]
+    I --> B
+    B --> J[Measurement verdict<br/>SLO pass, req/s/GPU, latency quantiles]
+    J --> C
+    G --> K[Stop decision<br/>only when coverage and measurement guards allow]
+```
+
+关键点：LLM 不应该绕过 `CandidateSet` 和 `Validator`。LLM 最多是 candidate ranker 或 copilot，
+不是 legality、coverage 或 stop 的 authority。
+
+## 模块语义
+
+### 1. Observation Schema
+
+Harness 先把一次 trial/probe 的结果转成结构化 observation：
+
+```text
+O_t = {
+  workload summary,
+  SLO rules,
+  effective engine config,
+  best feasible probe,
+  limiting probe,
+  failed_reason_counts,
+  early_stop_reason,
+  pass_rate,
+  request_rate_per_gpu,
+  launch / OOM status
+}
+```
+
+其中 `failed_reason_counts` 的定义是 request-level SLO violation reason 的 multiset 计数：
+
+```text
+ttft_ms>threshold       request 的 TTFT 超 SLO
+tpot_ms>threshold       request 的 TPOT 超 SLO
+arrival_lag_s>limit     synthetic arrivals 已经追不上
+probe_elapsed_s>limit   probe 总耗时超过上限
+slo_pass_rate_unrecoverable
+                        已失败过多，数学上无法达到 target pass rate
+request_failed / timeout / completion mismatch
+                        请求级失败
+```
+
+重要限制：
+
+- 一个 request 可以同时贡献 `ttft_ms>...` 和 `tpot_ms>...`；
+- `failed_reason_counts` 是 symptom evidence，不是 root-cause ground truth；
+- queueing/admission 主要来自 probe 调度层 early stop，而不是单个 request latency 的精确分解。
+
+因此文档和论文里必须避免说“failed count 证明 root cause 是 TTFT/TPOT”。更准确的说法是：
+
+```text
+failed_reason_counts gives SLO violation symptoms.
+Harness infers serving-stage hypotheses from these symptoms.
+```
+
+### 2. Evidence Compiler / Bottleneck Hypotheses
+
+当前 prototype 做两层聚合。
+
+第一层是单个 probe 的 active bottleneck。当前实现用 count-majority：
+
+```text
+ttft_count = sum(count(reason startswith "ttft"))
+tpot_count = sum(count(reason startswith "tpot"))
+other_request_failed_count = non-TTFT, non-TPOT request failures
+
+if ttft_count >= max(tpot_count, other_request_failed_count):
+    active = ttft_prefill
+elif tpot_count >= max(ttft_count, other_request_failed_count):
+    active = decode_tpot
+else:
+    active = admission_or_queueing
+```
+
+这一步是 heuristic。它的语义基础是 TTFT/TPOT/arrival-lag 对应不同 serving stages，
+但 “majority label = root cause” 并不成立。
+
+第二层是跨 trial 的 ranked hypotheses。它把以下证据合成 score：
+
+- workload prior：decode-only + TPOT SLO 更支持 decode hypothesis；长 prompt tail 更支持 prefill hypothesis；
+- latest probe active label；
+- historical probe evidence；
+- `failed_reason_counts` 中 TTFT/TPOT/queueing symptom ratio；
+- launch failure / OOM。
+
+更稳健的目标设计应该把第一层 hard label 改成 soft evidence vector：
+
+```text
+e_prefill   = normalized count of TTFT symptoms
+e_decode    = normalized count of TPOT symptoms
+e_admission = normalized count of arrival lag / elapsed / unrecoverable / request failures
+e_memory    = launch or OOM evidence
+```
+
+Candidate generator 应该基于 evidence distribution 生成 top mechanism probes，而不是只相信一个
+hard dominant bottleneck。当前 prototype 的 hard label 是工程近似，不是最终 contribution。
+
+### 3. Mechanism Space
+
+Harness 不在 raw knob Cartesian product 中盲搜：
+
+```text
+raw space = {TP, DP, EP, GMU, MNS, MBT, chunked-prefill, ...}
+```
+
+它先把 knobs 映射到 serving pipeline 上的可控 mechanism：
+
+| Mechanism family | Example knobs | 机制含义 | 典型 evidence |
+| --- | --- | --- | --- |
+| Topology / resource partition | TP, DP, EP, visible GPUs | 改变 compute/memory 分布、replica 数、per-GPU efficiency | TTFT/TPOT pressure, topology frontier 未覆盖 |
+| Prefill scheduler | chunked prefill, MBT | 改变 prefill quantum 和 head-of-line blocking | TTFT symptoms, long prompt tail, low prefix reuse |
+| Admission / concurrency | MNS | 改变活跃 sequence 数和 batch/admission pressure | arrival lag, pass-rate unrecoverable, concurrency underuse |
+| KV/cache headroom | GMU, block/cache knobs | 改变 KV cache blocks 和 memory feasibility | cache pressure, launch/memory, topology settled 后仍有 SLO pressure |
+| Launch/memory feasibility | env, memory-affecting flags | 确认 engine 是否能启动、是否 OOM | launch failure, OOM |
+| Frontier delta transfer | measured runtime delta applied to other Pareto anchors | 将已测 runtime 改动投影到未测 frontier anchor | 同 topology 上 runtime delta 为正，且存在其他 Pareto anchor |
+
+这些 family 的依据不是某个 case 的 winning config，而是 LLM serving pipeline：
+
+```text
+arrival/admission -> prefill -> decode -> memory/launch feasibility
+```
+
+每个 family 必须满足三条约束：
+
+1. 它对应一个可解释的 serving mechanism；
+2. 它只生成 engine schema 和 hardware constraints 下合法的 candidate；
+3. 它的 confirm/reject condition 由真实 measurement 决定。
+
+限制：
+
+- Mechanism family 是 domain-specific，不是 engine-agnostic magic；
+- vLLM/SGLang 等 engine 的 knob 名称不同，需要 adapter 把 engine knobs 映射到同一 mechanism vocabulary；
+- family 本身有系统依据，但当前 score 常数和部分 gate 仍是 heuristic。
+
+### 4. CandidateSet / Intervention Generation
+
+Candidate 不是“一个 patch”这么简单。一个合法 candidate 应包含：
+
+```text
+candidate = {
+  mechanism_family,
+  config_patch,
+  hypothesis,
+  expected_effects,
+  confirm_condition,
+  reject_condition,
+  effective_full_config_signature
+}
+```
+
+当前 prototype 的候选顺序大致是：
+
+1. topology candidates；
+2. frontier-delta projection candidates；
+3. runtime candidates。
+
+其中 runtime candidates 又包含 prefill scheduler、MBT、MNS、GMU 等 family。
+
+设计假设：
+
+- topology/resource partition 通常改变较大的 capacity frontier；
+- runtime knobs 通常是同一 topology 下的 local refinement；
+- 当 topology frontier 未覆盖时，过早 runtime hill-climbing 可能把搜索困在坏 topology；
+- 当一个 runtime delta 已在某个 topology 上测得正收益时，把这个 delta 投影到其他 Pareto
+  anchor 是比完整 factorial grid 更便宜的 interaction test。
+
+限制：
+
+- `topology-before-runtime` 是强 prior，不是定理；需要 ablation；
+- frontier delta transfer 依赖已测 history，如果 history 太少就不能工作；
+- 当前 prototype 中一些 target step 和 score 常数仍然是人工 heuristic。
+
+### 5. Planner Interface
+
+Planner 的职责应该被限制为：
+
+```text
+rank/select candidate from CandidateSet
+explain why this candidate is worth the next trial
+```
+
+Planner 不应该：
+
+- 构造 schema 外的 knob；
+- 绕过 topology / memory constraints；
+- 重复已经测试过的 effective full config；
+- 单方面决定 stop；
+- 把自然语言猜测当成 measurement verdict。
+
+这也是 no-LLM harness 能工作的原因：只要 `CandidateSet` 和 `Validator` 足够有信息，
+一个 deterministic planner 也可以完成 tuning。LLM 的价值在于组合 evidence、解释 tradeoff、
+在候选较多时排序，而不是提供 tuning correctness 的唯一来源。
+
+### 6. Validator / Stop Authority
+
+Validator 是 harness 防止 prompt engineering 化的关键。它负责：
+
+- canonicalize effective full config；
+- 拒绝 no-op 或 repeat；
+- 检查 legal topology / visible GPU / tunable schema；
+- 记录 failure memory；
+- 判断 measurement ceiling，例如 `search.high` 是否不足；
+- 在 candidate coverage 不足时禁止 premature stop；
+- 只有在覆盖和 measurement guards 都满足时授权 stop。
+
+重要设计修正：
+
+```text
+no-repeat must use normalized effective full-config signature,
+not patch signature.
+```
+
+因为 runtime-only patch 在 materialization 时会继承 incumbent topology。
+如果只看 patch signature，可能把 `{"gmu": 0.9}` 误认为新 config，
+但真实执行时它可能 materialize 成已测过的 full config。
+
+限制：
+
+- Validator 只能保证相对于声明的 grammar/operator set 的 coverage；
+- 它不能证明全 raw knob space 没有更优点；
+- measurement ceiling 不足时应报告并请求人类确认，而不是静默合成 arrivals 或重复窗口。
+
+## 精确贡献表述
+
+我们应该 claim：
+
+```text
+AITuner introduces a planner-agnostic harness that converts LLM serving
+configuration tuning from black-box knob search into typed, measurement-grounded
+counterfactual experiments over serving mechanisms.
+```
+
+可拆成三点贡献：
+
+1. **Serving-stage evidence compiler**
+   将 workload profile、SLO violation symptoms、probe early stop 和 launch failure
+   转换为 prefill/decode/admission/memory/launch 的机制证据，而不是只给 planner 一个 scalar score。
+
+2. **Typed mechanism action space**
+   将 raw knobs 组织为 topology、prefill scheduler、admission/concurrency、cache headroom、
+   frontier transfer 等 intervention families，使搜索发生在 mechanism space 而不是任意 knob vector space。
+
+3. **Validator-controlled experimental loop**
+   用 full-config signature、constraints、failure memory、coverage 和 measurement guards
+   控制 proposal 与 stop，使 LLM/BO/heuristic 都只能在合法、可审计的 candidate set 上工作。
+
+我们不应该 claim：
+
+- bottleneck classifier 永远正确；
+- `failed_reason_counts` 是 root cause label；
+- 当前 heuristic score 常数有理论最优性；
+- harness 覆盖完整 raw knob space；
+- stop 证明全局最优；
+- 某个 case 的 winning config 被系统“证明”出来。
+
+## 必须补的证据
+
+为了证明贡献不是 rule accumulation，后续实验必须 ablate family 和 authority，而不是只报最终性能：
+
+| Ablation | 证明什么 |
+| --- | --- |
+| classifier off / shuffled evidence | evidence attribution 是否真的影响正确方向 |
+| mechanism space off，改用 raw random/BO | mechanism action space 是否压缩搜索并提升收敛 |
+| topology-before-runtime off | 大 frontier intervention prior 是否必要 |
+| frontier-delta projection off | cross-topology runtime transfer 是否解决 bad-start/local trap |
+| validator off / patch signature only | full-config validator 是否避免重复和 false progress |
+| no-LLM deterministic planner | harness 是否是 planner-agnostic substrate |
+| weak planner + harness vs strong planner naive | harness 是否能补偿 planner 能力差距 |
+
+最终论文表达应保持这个边界：
+
+```text
+Harness makes the search more structured, auditable, and measurement-efficient.
+It does not replace measurement, does not prove global optimality, and does not
+turn symptom labels into perfect causal diagnosis.
+```
diff --git a/docs/aituner-roadmap.md b/docs/aituner-roadmap.md
index 1adc4c7..273f8fe 100644
--- a/docs/aituner-roadmap.md
+++ b/docs/aituner-roadmap.md
@@ -64,6 +64,10 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
 
 ## 关键设计点
 
+当前 harness 设计语义、模块假设和限制见
+[AITuner Harness Design Contract](aituner-harness-design-contract.md)。Roadmap 只维护
+claim 和实验优先级；design contract 负责精确定义我们能说什么、不能说什么。
+
 | 设计点 | 更强表述 | 作用 | 需要证明 |
 | --- | --- | --- | --- |
 | Observation | mechanism state | 将 workload shape、probe trace、SLO failure、launch/memory failure 结构化 | agent 看到的是可计算状态，不是自然语言日志 |