Design declarative intervention harness

2026-06-26 17:15:06 +08:00
parent 92eb186006
commit 4075c7abf0
3 changed files with 1017 additions and 4 deletions
--- a/docs/aituner-roadmap.md
+++ b/docs/aituner-roadmap.md
@@ -39,6 +39,10 @@ M: measurement/scoring protocol
   SLO-constrained feasible frontier, req/s/GPU, latency quantiles, pass-rate guard
 ```

+当前 `src/aituner/harness.py` 是 prototype：它已经展示 no-LLM loop 和 mechanism-guided
+proposal 的可行性，但仍然包含大量 rule-based heuristics，不能作为最终 harness
+contribution。新的目标设计见 [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md)。
+
 Planner 是可替换的：

 ```text
@@ -79,11 +83,28 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
 | C5. AITuner 找到 near-optimal region，而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
 | C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep；同时画 SLO tightness -> frontier/regime transition |
 | C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行，暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后，再做 SGLang adapter 和一个低成本验证 case |
-| C8. Harness 对坏初始点有恢复能力，不只依赖可信 base config | planner 规则和本地回归测试已补；真机待跑 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 从 `TP=8, max-num-seqs=8, gmu=0.5` 等坏起点做 no-LLM 真机 recovery run |
+| C8. Harness 对坏初始点有恢复能力，不只依赖可信 base config | 当前 rule-based fix 只能作为 prototype 信号，不能作为最终 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator 后跑 random/adversarial start distribution |

 ## 最高优先级实验

-### P0a. Bad-start recovery confirmation
+### P0a. Declarative harness redesign gate
+
+目的：停止继续向 `harness.py` 添加 testcase-specific rules，把 harness 重构成
+declarative intervention grammar + coverage-relative validator。
+
+最低验收：
+
+- CandidateSet 完整枚举并持久化 snapshot；
+- `harness_priority` 与 backend ranking 分离；
+- CoverageUnit 结构化，stop 不能只依赖 exact signature；
+- Failure invalidation 有保守 region predicate 和 retry/unblock 条件；
+- grammar/policy/capability 都有 version 和 anti-overfitting static checks；
+- LLM/BO 只能选择合法 candidate，不能绕过 validator。
+
+优先级原因：如果不先完成这个 gate，继续扩展 bad-start/SLO/2x2 实验只是在证明一个
+rule-based prototype。
+
+### P0b. Bad-start recovery confirmation after redesign

 目的：回答 harness 是否只能从可信 base config 起步，还是能从明显不合理的初始 config
 恢复到正确方向。
@@ -96,6 +117,9 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
 | bad-runtime | `TP=2, gmu=0.5, max-num-seqs=8` | 低 runtime headroom 会跳回 nominal floor |
 | combined-bad | `TP=8, gmu=0.5, max-num-seqs=8` | topology recovery 和 runtime recovery 能串联 |

+注意：这不是先跑一条手工 bad case。必须在 declarative harness 之后跑 random/adversarial
+start distribution，并报告分布结果。
+
 预期图：

 - x-axis: trial index；
@@ -103,9 +127,9 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
 - line groups: trusted-start vs bad-start cases；
 - annotation: proposal family sequence，例如 `TP downshift`, `gmu floor jump`, `gmu climb`。

-启动条件：先确认 dash fleet 有空闲 8xH20 机器；用户确认后再开跑。
+启动条件：先完成 P0a；再确认 dash fleet 有空闲 8xH20 机器；用户确认后再开跑。

-### P0b. 完成 Qwen235B decode 2x2 并整理 aggregate
+### P0c. 完成 Qwen235B decode 2x2 并整理 aggregate

 目的：补齐最核心的 `harness on/off x strong/weak planner` 证据，回答：

--- a/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
+++ b/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
@@ -0,0 +1,981 @@
+# Declarative Intervention Harness Design - 2026-06-26
+
+本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的
+`if/else` 调参规则，并明确哪些性质可以被证明，哪些 claim 必须通过实验验证，哪些
+claim 不能说。
+
+结论先说清楚：
+
+```text
+harness 的贡献不是“专家规则知道该怎么调 vLLM”。
+
+harness 的贡献应该是：
+把 raw knob optimization 转成 declarative, coverage-aware experimental design。
+```
+
+也就是说，harness 不应该直接 hardcode 下一个答案；它应该定义：
+
+- 什么状态是可观测的；
+- 什么 intervention 是合法的；
+- 每个 intervention 验证哪个系统假设；
+- 哪些区域已经被测量、否定或覆盖；
+- 什么时候 planner 还能继续选实验；
+- 什么时候 stop 是 coverage-relative sound。
+
+Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些
+planner 共享的 substrate，不是某个 planner 的 prompt trick。
+
+## Independent review verdict
+
+本文档初稿经过 fresh subagent 独立 review，结论是：
+
+```text
+accept with major revisions
+```
+
+审稿意见的核心警告是：把规则从 Python `if/else` 搬到 grammar、operator、policy 或
+priority 中，仍然可能只是“换皮的 rule-based harness”。因此，本设计必须额外满足：
+
+- candidate set 必须完整枚举，并持久化 snapshot；
+- priority/threshold 必须有 provenance、版本和 sensitivity test；
+- coverage unit 必须结构化，不能只靠 `signature_tested`；
+- failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件；
+- stop proof 必须引用 candidate set snapshot；
+- grammar/policy/capability 也必须接受 anti-overfitting 静态检查；
+- LLM/BO 只能选择 candidate，不能暗中改变 legality、coverage 或 stop authority。
+
+下面的设计已经把这些 major revisions 纳入硬性要求。
+
+## 当前问题
+
+当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇：observation、
+bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based：
+
+- observation、bottleneck attribution、candidate generation、scoring、stop validator
+  都集中在一个 Python 文件里；
+- knob family 的行为由 Python 分支和固定阈值决定；
+- scoring 使用人工常数；
+- stop validator 是 guard cascade，不是显式 coverage proof；
+- 每次发现坏 case，容易继续加一个新 branch，例如 bad high-TP start 或 low-gmu start。
+
+这类实现可以做工程原型，但不能成为系统贡献。它最多证明：
+
+```text
+我们为已见过的一些 case 写出了一组能工作的 heuristic。
+```
+
+它不能证明：
+
+- harness 是通用的；
+- harness 对任意 bad start 都 robust；
+- stop 时不存在未测更优 config；
+- candidate generator 覆盖了所有重要 intervention；
+- AITuner 的收益不是来自 testcase-specific rule accumulation。
+
+因此，后续不能继续用“给 failure mode 补 rule”的方式推进。
+
+## 设计目标
+
+新的 harness 需要满足五个目标。
+
+1. **Declarative**
+   系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在，
+   而不是散落在 Python `if/else` 中。
+
+2. **Planner-agnostic**
+   Harness 负责 legality、candidate construction、coverage accounting 和 stop
+   authority；planner 只负责在合法 candidate set 上排序或选择。
+
+3. **Coverage-aware**
+   Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确
+   coverage 语义。
+
+4. **Anti-overfitting**
+   设计中不得出现 model name、case id、known winning config、固定 workload id 或实验
+   结果常数。Case 只能通过 `StudySpec`、engine capability、workload profile 和 trial
+   measurements 进入系统。
+
+5. **Auditable**
+   每个 proposal 必须能回答：
+
+```text
+这个 trial 验证哪个 hypothesis？
+它改变哪个 intervention dimension？
+它的 confirm/reject condition 是什么？
+trial 结束后 coverage 如何变化？
+```
+
+## 非目标
+
+我们不追求也不应 claim 以下目标：
+
+- global optimum；
+- 任意 config 空间的 completeness；
+- 任意 serving engine/workload/SLO 的 robustness；
+- bottleneck classifier 永远正确；
+- stop 时真实世界不存在任何更优 config。
+
+我们追求的是更严格但可证明的相对性质：
+
+```text
+在声明的 grammar、operator set、engine constraints 和 measurement budget 下，
+harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。
+```
+
+## 新架构
+
+```text
+StudySpec + EngineCapability + WorkloadProfile + TrialHistory
+        |
+        v
+HarnessState
+        |
+        v
+InterventionGrammar
+        |
+        v
+OperatorRegistry -----> CandidateSet
+        |                    |
+        v                    v
+CoverageState <------ MeasuredVerdict
+        |
+        v
+CoverageValidator
+        |
+        v
+PlannerBackend
+        |
+        v
+Proposal or Stop
+```
+
+版本化对象：
+
+| Object | 必须版本化 | 进入哪些输出 |
+| --- | --- | --- |
+| `GrammarVersion` | yes | candidate id、candidate set snapshot、coverage delta、stop report |
+| `PolicyVersion` | yes | candidate priority、blocked reason、stop threshold、stop report |
+| `EngineCapabilityVersion` | yes | axis domain、safe range、candidate patch、failure invalidation |
+| `PlannerBackendVersion` | yes | selected candidate、ranking trace |
+
+没有版本号的 candidate/stop 不能作为 paper 证据。
+
+### HarnessState
+
+`HarnessState` 是 planner 看到的唯一状态入口。它由结构化数据构成：
+
+| Field | 来源 | 作用 |
+| --- | --- | --- |
+| `study` | `StudySpec` | tunable schema、topology constraints、SLO、search budget |
+| `engine_capability` | static profile / engine adapter | safe ranges、knob lattice、unsupported combinations |
+| `workload_profile` | trace summary / L-C-A profile | length/cache/arrival regime |
+| `trial_profiles` | measured results | latency、SLO failures、throughput、launch failures |
+| `incumbent` | state best | current best config and measured objective |
+| `tested_signatures` | trial history | no-repeat and coverage accounting |
+| `failure_memory` | launch/runtime failures | invalidate region or operator dimensions |
+
+重要约束：
+
+- `HarnessState` 不包含 natural-language prompt-only state；
+- 所有字段必须可序列化、可 snapshot、可 replay；
+- 同一个 `HarnessState + Grammar + Policy` 必须产生同一个 candidate set。
+
+### InterventionGrammar
+
+Grammar 描述 serving tuning 的 intervention space。它不选择候选，只声明：
+
+- family；
+- axes；
+- legal value source；
+- generic operators；
+- expected effect schema；
+- risk schema；
+- coverage dimensions；
+- required evidence。
+
+建议初始 grammar：
+
+| Family | Axes | Operators | Coverage dimension |
+| --- | --- | --- | --- |
+| `topology` | TP, DP, EP, EP enable | `step_up`, `step_down`, `redistribute`, `bracket` | topology lattice neighborhood |
+| `kv_memory` | `gpu-memory-utilization` | `jump_to_floor`, `local_climb`, `backoff_after_failure` | runtime trust region on incumbent topology |
+| `admission` | `max-num-seqs` | `raise`, `lower`, `bracket` | sequence concurrency region |
+| `batching` | `max-num-batched-tokens`, chunked prefill | `raise`, `lower`, `joint_adjust` | prefill/decode batching region |
+| `allocator` | block size / cache knobs | `switch_categorical`, `bracket` | memory layout region |
+| `failure` | failed signatures/regions | `invalidate_region`, `avoid_region` | failure memory coverage |
+
+Grammar example:
+
+```yaml
+family: kv_memory
+axes:
+  gpu-memory-utilization:
+    type: bounded_float
+    value_source: engine_capability
+    nominal_floor_key: gmu_nominal_floor
+    safe_ceiling_key: gmu_safe_ceiling
+operators:
+  - name: jump_to_floor
+    preconditions:
+      - axis_below_nominal_floor
+      - incumbent_topology_preserved
+    coverage_effect: covers_runtime_floor
+  - name: local_climb
+    preconditions:
+      - axis_at_or_above_nominal_floor
+      - no_failed_higher_target
+    coverage_effect: extends_runtime_trust_region
+```
+
+这里没有写 `0.5 -> 0.9`。`0.9` 来自 engine capability/policy config，
+`jump_to_floor` 是 bounded numeric axis 的通用 operator。
+
+Topology example:
+
+```yaml
+family: topology
+axes:
+  tensor-parallel-size:
+    type: ordered_lattice
+    value_source: topology_constraints
+  data-parallel-size:
+    type: ordered_lattice
+    value_source: topology_constraints
+operators:
+  - name: step_up
+    coverage_effect: covers_upper_neighbor
+  - name: step_down
+    coverage_effect: covers_lower_neighbor
+  - name: bracket
+    preconditions:
+      - anchor_at_boundary_or_unknown_region
+      - neighbor_uncovered
+    coverage_effect: brackets_ordered_axis
+  - name: redistribute
+    coverage_effect: covers_same_product_tp_dp_neighbor
+```
+
+这里没有写 `TP=8 -> TP=4`。它是 ordered lattice 上的 `bracket`。
+
+### AxisSpec and OperatorSpec
+
+Grammar 必须由 typed schema 表达，不能只是自由文本。
+
+`AxisSpec`：
+
+| Field | 含义 |
+| --- | --- |
+| `axis_id` | stable id，例如 `topology.tp` |
+| `knob_keys` | 该 axis lower 到 config patch 时影响的 knobs |
+| `type` | `ordered_lattice`, `bounded_float`, `bounded_int`, `categorical`, `coupled` |
+| `domain_source` | `topology_constraints`, `engine_capability`, `policy_config`, `study_spec` |
+| `domain` | legal values 或 range；必须可枚举或可离散化 |
+| `order` | 对 ordered axis 的 total/partial order |
+| `coupling` | 与其他 axes 的 legality constraints |
+| `safe_region` | floor/ceiling/default/unsupported regions，带 provenance |
+
+`OperatorSpec`：
+
+| Field | 含义 |
+| --- | --- |
+| `operator_id` | stable id，例如 `topology.bracket` |
+| `family` | topology / kv_memory / admission / batching / allocator / failure |
+| `input_axes` | operator 作用的 axes |
+| `preconditions` | generic predicates，必须引用 state/axis/evidence，不引用 testcase |
+| `patch_fn` | 从 state 和 axis value 生成 candidate patch |
+| `coverage_effects` | 产生的 structured `CoverageUnit` |
+| `required_evidence` | 需要哪些 evidence 才可启用 |
+| `risk_model` | risk/cost term 的来源和版本 |
+| `confirm_reject_schema` | 如何从 measured verdict 更新 coverage |
+
+任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。
+
+### Generic Operators
+
+Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。
+
+必须支持的基础 operator：
+
+| Operator | 输入 | 输出 | 用途 |
+| --- | --- | --- | --- |
+| `step_up(axis)` | ordered axis | adjacent higher value | 验证上方 frontier |
+| `step_down(axis)` | ordered axis | adjacent lower value | 验证下方 frontier |
+| `bracket(axis)` | ordered axis + boundary/trust-region state | adjacent contrast point | 从偏置初始点建立局部 bracket |
+| `jump_to_floor(axis)` | bounded axis + nominal floor | floor target | 从非法/低效区间回到 safe operating range |
+| `local_climb(axis)` | bounded axis + step policy | next value | 在 trust region 内局部爬坡 |
+| `backoff(axis)` | failed higher target | lower safe value | 处理 launch/OOM/regression |
+| `redistribute(axis_a, axis_b)` | coupled axes + product constraints | legal coupled patch | TP/DP/EP redistribution |
+| `joint_adjust(axes)` | coupled runtime axes | legal joint patch | MBT/MNS 交互 |
+| `preserve(dimensions)` | incumbent | complete patch context | 保持 topology/runtime anchor |
+| `block_if_seen(signature)` | tested signatures | reject candidate | no-repeat |
+| `invalidate_region(failure)` | failure profile | region predicate | 避免重复失败区域 |
+
+Operator 只做 candidate construction，不做最终接受。接受由 real measurement 决定。
+
+### CandidateAction
+
+每个 candidate 必须是结构化对象：
+
+```json
+{
+  "candidate_id": "topology.bracket/tp:8->4/dp:1",
+  "family": "topology",
+  "operator": "bracket",
+  "patch": {"flag_patch": {"tensor-parallel-size": 4}},
+  "hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.",
+  "targets": ["topology_efficiency", "ttft_prefill"],
+  "expected_metric_movement": {
+    "request_rate_per_gpu": "increase_or_same",
+    "ttft_p95": "may_increase",
+    "gpu_count": "decrease"
+  },
+  "confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency",
+  "reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance",
+  "coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"],
+  "risk": {
+    "launch": "low",
+    "regression": "medium",
+    "measurement_cost": "one_trial"
+  }
+}
+```
+
+这比“try TP=4”严格：trial 被绑定到 hypothesis、metric movement 和 coverage update。
+
+### CandidateSet
+
+`CandidateSet` 是 grammar/operator 在当前 state 下的完整枚举结果，而不是 planner 看到的
+前几个候选。
+
+```text
+CandidateSet = {
+  grammar_version,
+  policy_version,
+  engine_capability_version,
+  state_hash,
+  candidates[],
+  blocked_candidates[],
+  candidate_set_hash
+}
+```
+
+要求：
+
+1. `OperatorRegistry.generate(state, grammar, policy)` 必须枚举所有 eligible candidates。
+2. 所有 filtered candidate 必须进入 `blocked_candidates`，并带 `blocked_reason`。
+3. Stop proof 必须引用 `candidate_set_hash`。
+4. 同一 `HarnessState + Grammar + Policy + EngineCapability` 必须生成 byte-identical
+   `CandidateSet`。
+5. 对 toy grammars，必须能穷举验证 generator 没漏 candidate。
+
+这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。
+
+### CoverageState
+
+Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”，不是只记录 best config。
+
+Coverage units:
+
+| Unit | 含义 |
+| --- | --- |
+| `signature_tested` | exact config patch 已测 |
+| `operator_covered` | 某 operator 在某 family/neighborhood 上至少被验证一次 |
+| `neighbor_covered` | ordered lattice 的相邻上/下点已测 |
+| `trust_region_covered` | incumbent topology 上某 runtime family 的 local region 已测 |
+| `hypothesis_refuted` | candidate 的 reject condition 被满足 |
+| `region_invalidated` | failure memory 否定一片 region，而不是单点 |
+| `incumbent_validated` | incumbent 被 topology/runtime/failure dimensions 的反事实实验验证 |
+
+`CoverageUnit` 不能是自由字符串，必须是结构化 key：
+
+```json
+{
+  "unit_type": "neighbor_covered",
+  "family": "topology",
+  "axis": "topology.tp",
+  "operator": "bracket",
+  "anchor_region": {"tp": 8, "dp": 1},
+  "candidate_region": {"tp": 4, "dp": 1},
+  "bottleneck": "ttft_prefill",
+  "grammar_version": "..."
+}
+```
+
+Stop 前不能只依赖 `signature_tested`。Required coverage 必须至少包含一个非平凡机制
+coverage unit，例如 `neighbor_covered`、`trust_region_covered`、`region_invalidated`
+或 `incumbent_validated`。
+
+Coverage update:
+
+```text
+trial_result -> MeasuredVerdict -> CoverageDelta
+```
+
+每个 measured trial 必须至少产生一个 `CoverageDelta`：
+
+- 增加 `signature_tested`；
+- 增加 family/operator coverage；
+- invalidate failure region；
+- update incumbent；
+- refute/confirm hypothesis。
+
+如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent，它应被标记为
+`wasted_by_design`，这会直接违反验收标准。
+
+更严格地说：
+
+```text
+signature_tested alone is not enough.
+```
+
+一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何
+operator/neighborhood/trust region、没有更新 incumbent，也没有产生保守 failure
+invalidation，就必须标记为 `non_informative_trial`。
+
+### Failure invalidation
+
+`region_invalidated` 是最容易过拟合或误杀的部分，必须保守定义。
+
+一个 failure invalidation 必须输出：
+
+```json
+{
+  "region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99",
+  "source_failure": "engine_launch_oom",
+  "covered_config_examples": [...],
+  "excluded_config_examples": [...],
+  "conservative_reason": "failure only observed at higher memory target on same topology",
+  "retry_condition": "lower gmu or different topology is tested",
+  "expires_after": "optional policy-defined TTL"
+}
+```
+
+禁止从单点 failure 无边界外推到整个 family。例如一次 `TP=8,gmu=0.99` OOM 不能直接
+invalidate 所有 `TP=8` 或所有高 TP；它最多 invalidates 包含相同或更高 memory pressure
+的保守 region，除非后续 evidence 扩大该 region。
+
+### CoverageValidator
+
+Stop 必须相对于 coverage，而不是相对于 Python guard cascade。
+
+Stop condition:
+
+```text
+stop_allowed iff
+  legality soundness holds
+  and candidate_set_snapshot is complete and persisted
+  and no candidate with harness_priority >= threshold remains uncovered
+  and incumbent has required validation coverage
+  and measurement budget/search-high condition is satisfied or no eligible candidate remains
+```
+
+这里不用未定义的 `useful candidate`。一个 candidate 是否 eligible 由以下结构化条件决定：
+
+```text
+eligible(candidate) iff
+  candidate is legal
+  and candidate is not exact-repeat
+  and candidate.coverage_unit is not already covered
+  and candidate.region is not invalidated
+  and candidate.required_evidence is satisfied
+  and candidate.harness_priority >= policy.min_candidate_priority
+```
+
+`required validation coverage` 也必须结构化声明。例如：
+
+```yaml
+required_validation:
+  incumbent:
+    - family: topology
+      unit_type: neighbor_covered
+      neighborhood: adjacent
+    - family: kv_memory
+      unit_type: trust_region_covered
+      when_axis_tunable: gpu-memory-utilization
+```
+
+Validator 输出必须包含：
+
+```json
+{
+  "should_stop": true,
+  "stop_kind": "coverage_relative_stop",
+  "candidate_set_hash": "...",
+  "grammar_version": "...",
+  "policy_version": "...",
+  "engine_capability_version": "...",
+  "covered_units": [...],
+  "remaining_candidates": [],
+  "blocked_candidates": [
+    {"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"}
+  ],
+  "proof_obligation": "No uncovered candidate above priority threshold under grammar version X."
+}
+```
+
+这里的 proof 是 relative proof：
+
+```text
+不是证明真实世界没有更优 config；
+而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。
+```
+
+## Planner backend 分离
+
+Harness 产出 `CandidateSet`，planner backend 只做选择。
+
+```text
+CandidateSet = Grammar(HarnessState)
+decision = PlannerBackend.rank_or_select(CandidateSet)
+```
+
+Planner backend 可以包括：
+
+| Backend | 作用 |
+| --- | --- |
+| deterministic greedy | 选择最高 priority / score candidate |
+| LLM ranker | 读取 candidate set 和 evidence，只允许选择或组合合法 candidate |
+| BO/bandit | 在 intervention graph 上学习 candidate priority |
+| oracle replay | 用已有 grid/replay data 做离线验证 |
+
+关键约束：
+
+- planner 不得发明 grammar 外 knob；
+- planner 不得绕过 validator stop；
+- planner 只能改变 ranking/selection，不能改变 legality/coverage；
+- 同一 candidate set 下，LLM 与 BO 的差异是 backend 差异，不是 harness 差异。
+
+Priority 职责边界：
+
+- `harness_priority` 属于 harness/policy，用于 validator 判断 candidate 是否 high-priority。
+- `backend_score` 属于 planner backend，用于在 eligible candidate 中排序。
+- BO/bandit 可以学习 `backend_score`，但不能改变 `harness_priority`、coverage 和 legality。
+- 如果需要让学习结果影响 stop threshold，必须显式进入 `PolicyVersion`，并重新生成
+  candidate set snapshot。
+
+LLM 组合候选的限制：
+
+- 默认不允许 LLM 组合多个 candidates；
+- 如果启用组合，组合必须生成新的 `CandidateAction`；
+- 组合后的 patch 必须重新经过 legality validator；
+- 组合后的 coverage effect 不能简单取 union，必须由 grammar 中的 `joint_adjust`
+  operator 生成；
+- 没有 `joint_adjust` operator 的组合不允许执行。
+
+## Required implementation interfaces
+
+后续代码重构必须先落这些接口，再迁移现有 heuristic。
+
+```python
+class InterventionGrammar:
+    @classmethod
+    def load(cls, path: str) -> "InterventionGrammar": ...
+    def validate(self, capability: "EngineCapability") -> None: ...
+
+
+class OperatorRegistry:
+    def generate(
+        self,
+        state: "HarnessState",
+        grammar: "InterventionGrammar",
+        policy: "HarnessPolicy",
+    ) -> "CandidateSet": ...
+
+
+class InterventionOperator:
+    def apply(
+        self,
+        axis: "AxisSpec",
+        state: "HarnessState",
+        policy: "HarnessPolicy",
+    ) -> "CandidateAction | BlockedCandidate": ...
+
+
+class CoverageState:
+    def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ...
+
+
+class CoverageValidator:
+    def validate_stop(
+        self,
+        state: "HarnessState",
+        candidate_set: "CandidateSet",
+        coverage: "CoverageState",
+        policy: "HarnessPolicy",
+    ) -> "StopReport": ...
+
+
+class PlannerBackend:
+    def select(
+        self,
+        candidate_set: "CandidateSet",
+        state: "HarnessState",
+    ) -> "CandidateAction | StopRequest": ...
+```
+
+Required schemas:
+
+| Schema | 必须字段 |
+| --- | --- |
+| `CandidateSignature` | exact signature、normalized full config signature、partial patch signature、hash version |
+| `MeasuredVerdict` | candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs |
+| `CoverageDelta` | added units、invalidated regions、incumbent update、non-informative flag |
+| `BlockedCandidate` | candidate skeleton、blocked reason、blocking predicate、retry condition |
+| `StopReport` | candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation |
+
+这些 schema 是 paper artifact 的一部分。没有它们，实验结果无法证明 harness 不是
+case-specific heuristic。
+
+## Anti-overfitting 约束
+
+### 禁止项
+
+以下内容不得出现在 harness core、grammar 或 operator 中：
+
+- model name，例如 Qwen30B、Qwen27B；
+- host name，例如 dash0；
+- case id、run id、trace window id；
+- known winning config，例如 “TP=2+gmu=0.97”；
+- 固定 SLO 数值作为控制逻辑，例如 “TPOT=50ms 时做 X”；
+- 根据实验结果补的特例 branch，例如 “如果 current_tp > 2 就强行 downshift”；
+- 针对单个 regression test 的 action id 或 score boost。
+
+### 允许项
+
+以下内容可以存在，但必须声明来源和适用范围：
+
+- engine capability 中的 safe range，例如 `gpu-memory-utilization` nominal floor；
+- topology constraints 中的 legal lattice，例如 TP/DP/EP 值；
+- policy config 中的 measurement-cost / risk prior；
+- workload profile 中的 length/cache/arrival regime；
+- SLO rule 本身；
+- measured trial evidence。
+
+Priority/threshold provenance：
+
+所有 priority term、threshold、risk/cost prior 必须记录：
+
+```text
+name
+default value
+source: engine docs / prior experiments / conservative default / calibrated profile
+scope: global / engine-family / engine-version / hardware-class
+sensitivity test
+last updated commit
+```
+
+禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时，必须跑 sensitivity
+test 和 held-out scenario tests。
+
+### Rule 与 grammar 的边界
+
+不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate：
+
+| 禁止 | 允许 |
+| --- | --- |
+| `if model == Qwen30B` | `if family == topology and axis.type == ordered_lattice` |
+| `if TP == 8 then TP=4` | `bracket(ordered_axis)` |
+| `if gmu == 0.5 then 0.9` | `jump_to_floor(bounded_axis)` |
+| `if TTFT4s/TPOT25 case` | `if bottleneck hypothesis targets TTFT/TPOT evidence` |
+| `score = 0.74 for this bad-start fix` | `priority = relief_prior + information_gain - risk - cost` from policy terms |
+
+## 可证明性质
+
+### P1. Legality soundness
+
+对于任意 generated candidate：
+
+```text
+candidate.patch satisfies:
+  tunable_envs/tunable_flags
+  topology constraints
+  engine capability safe constraints
+  no forbidden knob
+```
+
+测试方式：
+
+- grammar-level property tests；
+- random StudySpec / topology constraints fuzzing；
+- all candidate patches pass `validate_proposal` equivalent checks。
+
+### P2. No-repeat
+
+对于任意 candidate：
+
+```text
+signature(candidate.patch) not in tested_signatures
+```
+
+如果 exact signature 被测过，candidate 必须被 filtered 或标记为 already-covered。
+
+### P3. Coverage monotonicity
+
+每个 measured trial 后：
+
+```text
+coverage_{t+1} >= coverage_t
+```
+
+其中 `>=` 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence
+不减少。
+
+### P4. Coverage-relative stop soundness
+
+如果 validator 输出 stop：
+
+```text
+for all candidate in persisted CandidateSet(grammar, state, policy):
+  candidate.harness_priority < threshold
+  or candidate.coverage_unit already covered
+  or candidate.region invalidated
+  or candidate violates constraints
+```
+
+这是相对于 grammar 的 soundness，不是 global optimality。
+
+额外前提：
+
+- CandidateSet 必须是完整枚举；
+- CandidateSet snapshot 必须持久化；
+- StopReport 必须包含 `candidate_set_hash`；
+- required coverage 不能只由 `signature_tested` 满足。
+
+### P5. Backend independence
+
+给定相同 `HarnessState + Grammar + Policy`：
+
+```text
+CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends.
+```
+
+不同 backend 只能改变 `backend_score`、candidate ranking 或 tie-break。
+
+### P6. Auditability
+
+每个 executed trial 必须能追溯：
+
+```text
+proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta
+```
+
+缺失任一环节则不允许作为 paper 证据。
+
+## 不能证明的性质
+
+必须在 paper 中显式避免：
+
+- global optimum；
+- arbitrary bad-start completeness；
+- arbitrary workload/SLO generalization；
+- engine-independent optimality；
+- bottleneck classifier correctness；
+- stop implies no better real config。
+
+可以说：
+
+```text
+Within the declared intervention grammar and measurement budget, AITuner provides
+coverage-relative experimental control and empirically improves convergence on the
+reported workloads.
+```
+
+## 验收标准
+
+这部分是最严格的 gate。没有通过这些标准之前，不应把新 harness 当作 paper contribution。
+
+### A. 设计验收
+
+1. Harness core 不允许新增 testcase-specific branch。
+2. 所有 knob 行为必须来自 grammar/operator/policy config。
+3. Magic constants 必须集中到 `HarnessPolicy` 或 engine capability，并说明来源。
+4. Candidate generator 必须输出 `CandidateAction`，包含 hypothesis、operator、
+   confirm/reject condition 和 coverage effect。
+5. Stop validator 必须输出 coverage-relative proof obligation。
+6. Planner backend 不能绕过 candidate set 或 stop validator。
+7. CandidateSet 必须完整枚举、持久化 snapshot，并由 stop report 引用。
+8. `harness_priority` 和 `backend_score` 必须分离。
+9. `CoverageUnit` 必须结构化，且 required coverage 不能只靠 exact signature。
+10. Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。
+
+### B. 代码验收
+
+1. `harness.py` 不再继续膨胀；新实现拆分为：
+   - `harness_state.py`
+   - `intervention_grammar.py`
+   - `operators.py`
+   - `coverage.py`
+   - `planner_backend.py`
+   - `harness_compat.py` or equivalent adapter。
+2. `CandidateAction`、`CoverageState`、`CoverageDelta`、`HarnessPolicy` 使用 dataclass
+   或 typed schema。
+3. 当前 `_topology_candidate_actions()` 和 `_runtime_candidate_actions()` 的逻辑必须迁移
+   到 operator registry。
+4. 所有 candidate patch 必须通过 legality validator。
+5. 所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。
+6. 当前 regression tests 可以保留为 scenario tests，但不得通过新增 hardcoded branch
+   来修。
+7. StopReport 必须包含 `candidate_set_hash`、grammar/policy/capability versions、
+   remaining/covered/blocked candidates。
+8. Inconclusive measurement 不能错误地产生 `hypothesis_refuted` 或 `region_invalidated`。
+
+### C. 静态 anti-overfitting 验收
+
+CI 或测试中必须检查：
+
+1. Harness core 不包含 model name / run id / host name / known winning config 字符串。
+2. Harness core 不包含 case-specific action id。
+3. Score boost 不能针对单一 scenario test。
+4. Grammar 文件只能引用 family/operator/axis，不引用 testcase。
+5. 新增 scenario failure 时，PR 必须说明是补 grammar/operator，还是调整 policy config；
+   不允许直接补 testcase branch。
+6. Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。
+7. 新增或修改 policy threshold 必须包含 sensitivity test。
+8. 新增 operator/policy 后必须跑 held-out scenarios，不能只在触发该改动的 scenario 上通过。
+
+### D. 单元测试验收
+
+必须新增 grammar-level tests：
+
+1. Legal candidate generation under random topology constraints。
+2. No-repeat property。
+3. Coverage monotonicity。
+4. Stop iff no uncovered high-priority candidate。
+5. Failure memory invalidates regions, not just exact signatures。
+6. Backend independence: greedy vs mock-LLM ranker candidate set 一致。
+7. Bad-start property tests:
+   - 对 ordered lattice 的 boundary starts，`bracket` 生成相邻 contrast candidate；
+   - 对 bounded numeric axis 的 below-floor starts，`jump_to_floor` 生成 floor candidate；
+   - 测试不得写具体 `TP=8 -> TP=4` 或 `0.5 -> 0.9` 作为唯一条件，而应参数化。
+8. Candidate set determinism: 同一输入输出 byte-identical snapshot。
+9. Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。
+10. Combined candidate legality: 如果启用组合，组合后的 patch 和 coverage 必须合法。
+11. Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。
+12. Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。
+13. Stop falsification: 对已 stop 的 history，用 local grid/replay 检查是否漏掉 grammar 内
+    high-priority candidate。
+
+### E. 实验验收
+
+Paper 证据必须至少包含：
+
+1. **Planner-agnostic ablation**
+   - raw greedy / BO vs grammar-guided greedy / BO；
+   - weak LLM naive vs weak LLM + harness；
+   - strong LLM naive vs strong LLM + harness。
+
+2. **Mechanism ablation**
+   - full grammar；
+   - no attribution；
+   - shuffled attribution；
+   - no grammar, raw knobs；
+   - no coverage validator；
+   - no failure memory。
+
+3. **Bad-start robustness**
+   - 多个 random/adversarial starts；
+   - 不是只跑 `TP=8,gmu=0.5,max-num-seqs=8`；
+   - 报告 convergence distribution，而不是单条成功路径。
+
+4. **Near-optimum check**
+   - 至少一个 case 做 local grid/expert comparison；
+   - 报告 AITuner best 与 local grid best 的 gap。
+
+5. **Cross-regime check**
+   - 至少两个不同 regime：long-prefill/tight-TTFT 和 decode/admission-heavy。
+
+### F. Paper claim 验收
+
+不通过上述实验前，paper 只能说：
+
+```text
+The current implementation demonstrates the feasibility of a deterministic,
+mechanism-guided proposal loop on selected cases.
+```
+
+通过设计和代码验收后，可以说：
+
+```text
+AITuner implements a planner-agnostic intervention grammar with coverage-relative
+stop authority.
+```
+
+通过实验验收后，才可以说：
+
+```text
+The harness improves fixed-budget tuning and convergence robustness across the
+reported regimes, and the gain is not attributable solely to a stronger LLM or
+case-specific prompt engineering.
+```
+
+仍然不能说：
+
+```text
+The harness is complete over all configs.
+The harness guarantees global optimum.
+The harness is robust to arbitrary workloads/engines.
+```
+
+## 迁移计划
+
+### Phase 0: 冻结 rule growth
+
+- 暂停向 `harness.py` 添加新的 testcase-specific rules。
+- 所有新 bad case 先写成 missing grammar/operator issue。
+- Roadmap 中标记当前 rule-based harness 为 prototype。
+
+### Phase 1: Typed state and candidate schema
+
+- 抽出 `TrialProfile`、`BottleneckHypothesis`、`CandidateAction`、`CoverageState`、
+  `HarnessPolicy`。
+- 同时抽出 `AxisSpec`、`OperatorSpec`、`CandidateSet`、`CoverageUnit`、
+  `CoverageDelta`、`MeasuredVerdict`、`BlockedCandidate`、`StopReport`。
+- 现有行为保持兼容，只改变结构。
+- Golden tests 使用当前 outputs 防止 accidental regression。
+
+### Phase 2: Operator registry
+
+- 先迁移 topology、kv_memory、admission、batching 四类。
+- 每个 operator 负责生成 candidate 和 coverage effect。
+- 旧 `_topology_candidate_actions()`、`_runtime_candidate_actions()` 变成 compat wrapper。
+
+### Phase 3: Coverage validator
+
+- 实现 coverage accounting。
+- StopPolicy 先复现当前 guard 结果，再替换 family-count heuristic。
+- 新增 coverage-relative stop tests。
+
+### Phase 4: Planner backend split
+
+- Greedy planner 作为默认 backend。
+- LLM backend 只能 rank/select candidate。
+- BO/bandit backend 可后续接入。
+
+### Phase 5: Experiment rerun
+
+- 重跑 no-LLM Qwen30B；
+- 重跑 2x2；
+- 跑 bad-start distribution；
+- 跑 mechanism ablation；
+- 跑 local grid/expert comparison。
+
+## 最关键的判断标准
+
+如果一个新 failure 只能通过下面方式修复：
+
+```text
+在 planner 里新增一个针对具体 config/workload 的 if-else
+```
+
+那么这不是 harness contribution。
+
+如果一个新 failure 通过下面方式修复：
+
+```text
+新增或修正一个通用 operator / coverage predicate / engine capability declaration，
+并且该变化改善一类 parameterized tests
+```
+
+它才可能是 harness contribution。
+
+这条标准应该作为后续所有 PR 和实验解释的 gate。
--- a/docs/harness-ablation/no-llm-harness-mechanism-20260625.md
+++ b/docs/harness-ablation/no-llm-harness-mechanism-20260625.md
@@ -1,5 +1,13 @@
 # No-LLM Harness Mechanism - 2026-06-25

+Status note, 2026-06-26:
+
+本文记录的是当前 rule-based prototype harness 的 no-LLM 机制和已有实验现象。它能证明
+AITuner 可以在没有 LLM endpoint 的情况下闭环运行，但不能证明 harness 的完备性、
+通用 robustness 或最终系统贡献。最终目标设计已经调整为 declarative intervention
+grammar + coverage-relative validator，见
+[`declarative-intervention-harness-design-20260626.md`](declarative-intervention-harness-design-20260626.md)。
+
 本文回答一个核心问题：如果不调用 LLM，harness 为什么还能自动找到配置？

 结论先说清楚：no-LLM 模式下并不是“没有 planner”。当前 harness 本身就是一个