# Declarative Intervention Harness Design - 2026-06-26 本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的 `if/else` 调参规则,并明确哪些性质可以被证明,哪些 claim 必须通过实验验证,哪些 claim 不能说。 结论先说清楚: ```text harness 的贡献不是“专家规则知道该怎么调 vLLM”。 harness 的贡献应该是: 把 raw knob optimization 转成 declarative, coverage-aware experimental design。 ``` 也就是说,harness 不应该直接 hardcode 下一个答案;它应该定义: - 什么状态是可观测的; - 什么 intervention 是合法的; - 每个 intervention 验证哪个系统假设; - 哪些区域已经被测量、否定或覆盖; - 什么时候 planner 还能继续选实验; - 什么时候 stop 是 coverage-relative sound。 Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些 planner 共享的 substrate,不是某个 planner 的 prompt trick。 ## Independent review verdict 本文档初稿经过 fresh subagent 独立 review,结论是: ```text accept with major revisions ``` 审稿意见的核心警告是:把规则从 Python `if/else` 搬到 grammar、operator、policy 或 priority 中,仍然可能只是“换皮的 rule-based harness”。因此,本设计必须额外满足: - candidate set 必须完整枚举,并持久化 snapshot; - priority/threshold 必须有 provenance、版本和 sensitivity test; - coverage unit 必须结构化,不能只靠 `signature_tested`; - failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件; - stop proof 必须引用 candidate set snapshot; - grammar/policy/capability 也必须接受 anti-overfitting 静态检查; - LLM/BO 只能选择 candidate,不能暗中改变 legality、coverage 或 stop authority。 下面的设计已经把这些 major revisions 纳入硬性要求。 ## 2026-06-26 adversarial status 我们已经用 `TP=8, gmu=0.5, max-num-seqs=8` 的 bad-start case 攻击当前 production harness。结果显示当前 stop guard 会在 baseline 后触发 `search_high_saturated_by_incumbent`,没有生成 topology/resource-efficiency contrast。 这证明当前 implementation 还不是最终 contribution。 详细反例见 [Bad-start stop counterexample](bad-start-stop-counterexample-20260626.md)。 该反例给本设计增加一个硬约束: ```text search_high_saturated_by_incumbent may be measurement evidence, but it cannot bypass candidate-set coverage when topology/resource efficiency remains tunable. ``` 因此新的 CoverageValidator 必须先证明没有未覆盖的 high-priority candidate,才能授权 stop。对 `req/s/GPU` objective,未覆盖的 topology/resource-efficiency contrast 必须阻止 stop,除非 StudySpec 明确固定 topology/GPU budget。 ## 当前问题 当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇:observation、 bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based: - observation、bottleneck attribution、candidate generation、scoring、stop validator 都集中在一个 Python 文件里; - knob family 的行为由 Python 分支和固定阈值决定; - scoring 使用人工常数; - stop validator 是 guard cascade,不是显式 coverage proof; - 每次发现坏 case,容易继续加一个新 branch,例如 bad high-TP start 或 low-gmu start。 这类实现可以做工程原型,但不能成为系统贡献。它最多证明: ```text 我们为已见过的一些 case 写出了一组能工作的 heuristic。 ``` 它不能证明: - harness 是通用的; - harness 对任意 bad start 都 robust; - stop 时不存在未测更优 config; - candidate generator 覆盖了所有重要 intervention; - AITuner 的收益不是来自 testcase-specific rule accumulation。 因此,后续不能继续用“给 failure mode 补 rule”的方式推进。 ## 设计目标 新的 harness 需要满足五个目标。 1. **Declarative** 系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在, 而不是散落在 Python `if/else` 中。 2. **Planner-agnostic** Harness 负责 legality、candidate construction、coverage accounting 和 stop authority;planner 只负责在合法 candidate set 上排序或选择。 3. **Coverage-aware** Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确 coverage 语义。 4. **Anti-overfitting** 设计中不得出现 model name、case id、known winning config、固定 workload id 或实验 结果常数。Case 只能通过 `StudySpec`、engine capability、workload profile 和 trial measurements 进入系统。 5. **Auditable** 每个 proposal 必须能回答: ```text 这个 trial 验证哪个 hypothesis? 它改变哪个 intervention dimension? 它的 confirm/reject condition 是什么? trial 结束后 coverage 如何变化? ``` ## 非目标 我们不追求也不应 claim 以下目标: - global optimum; - 任意 config 空间的 completeness; - 任意 serving engine/workload/SLO 的 robustness; - bottleneck classifier 永远正确; - stop 时真实世界不存在任何更优 config。 我们追求的是更严格但可证明的相对性质: ```text 在声明的 grammar、operator set、engine constraints 和 measurement budget 下, harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。 ``` ## 新架构 ```text StudySpec + EngineCapability + WorkloadProfile + TrialHistory | v HarnessState | v InterventionGrammar | v OperatorRegistry -----> CandidateSet | | v v CoverageState <------ MeasuredVerdict | v CoverageValidator | v PlannerBackend | v Proposal or Stop ``` 版本化对象: | Object | 必须版本化 | 进入哪些输出 | | --- | --- | --- | | `GrammarVersion` | yes | candidate id、candidate set snapshot、coverage delta、stop report | | `PolicyVersion` | yes | candidate priority、blocked reason、stop threshold、stop report | | `EngineCapabilityVersion` | yes | axis domain、safe range、candidate patch、failure invalidation | | `PlannerBackendVersion` | yes | selected candidate、ranking trace | 没有版本号的 candidate/stop 不能作为 paper 证据。 ### HarnessState `HarnessState` 是 planner 看到的唯一状态入口。它由结构化数据构成: | Field | 来源 | 作用 | | --- | --- | --- | | `study` | `StudySpec` | tunable schema、topology constraints、SLO、search budget | | `engine_capability` | static profile / engine adapter | safe ranges、knob lattice、unsupported combinations | | `workload_profile` | trace summary / L-C-A profile | length/cache/arrival regime | | `trial_profiles` | measured results | latency、SLO failures、throughput、launch failures | | `incumbent` | state best | current best config and measured objective | | `tested_signatures` | trial history | no-repeat and coverage accounting | | `failure_memory` | launch/runtime failures | invalidate region or operator dimensions | 重要约束: - `HarnessState` 不包含 natural-language prompt-only state; - 所有字段必须可序列化、可 snapshot、可 replay; - 同一个 `HarnessState + Grammar + Policy` 必须产生同一个 candidate set。 ### InterventionGrammar Grammar 描述 serving tuning 的 intervention space。它不选择候选,只声明: - family; - axes; - legal value source; - generic operators; - expected effect schema; - risk schema; - coverage dimensions; - required evidence。 建议初始 grammar: | Family | Axes | Operators | Coverage dimension | | --- | --- | --- | --- | | `topology` | TP, DP, EP, EP enable | `step_up`, `step_down`, `redistribute`, `bracket` | topology lattice neighborhood | | `kv_memory` | `gpu-memory-utilization` | `jump_to_floor`, `local_climb`, `backoff_after_failure` | runtime trust region on incumbent topology | | `admission` | `max-num-seqs` | `raise`, `lower`, `bracket` | sequence concurrency region | | `batching` | `max-num-batched-tokens`, chunked prefill | `raise`, `lower`, `joint_adjust` | prefill/decode batching region | | `allocator` | block size / cache knobs | `switch_categorical`, `bracket` | memory layout region | | `failure` | failed signatures/regions | `invalidate_region`, `avoid_region` | failure memory coverage | Grammar example: ```yaml family: kv_memory axes: gpu-memory-utilization: type: bounded_float value_source: engine_capability nominal_floor_key: gmu_nominal_floor safe_ceiling_key: gmu_safe_ceiling operators: - name: jump_to_floor preconditions: - axis_below_nominal_floor - incumbent_topology_preserved coverage_effect: covers_runtime_floor - name: local_climb preconditions: - axis_at_or_above_nominal_floor - no_failed_higher_target coverage_effect: extends_runtime_trust_region ``` 这里没有写 `0.5 -> 0.9`。`0.9` 来自 engine capability/policy config, `jump_to_floor` 是 bounded numeric axis 的通用 operator。 Topology example: ```yaml family: topology axes: tensor-parallel-size: type: ordered_lattice value_source: topology_constraints data-parallel-size: type: ordered_lattice value_source: topology_constraints operators: - name: step_up coverage_effect: covers_upper_neighbor - name: step_down coverage_effect: covers_lower_neighbor - name: bracket preconditions: - anchor_at_boundary_or_unknown_region - neighbor_uncovered coverage_effect: brackets_ordered_axis - name: redistribute coverage_effect: covers_same_product_tp_dp_neighbor ``` 这里没有写 `TP=8 -> TP=4`。它是 ordered lattice 上的 `bracket`。 ### AxisSpec and OperatorSpec Grammar 必须由 typed schema 表达,不能只是自由文本。 `AxisSpec`: | Field | 含义 | | --- | --- | | `axis_id` | stable id,例如 `topology.tp` | | `knob_keys` | 该 axis lower 到 config patch 时影响的 knobs | | `type` | `ordered_lattice`, `bounded_float`, `bounded_int`, `categorical`, `coupled` | | `domain_source` | `topology_constraints`, `engine_capability`, `policy_config`, `study_spec` | | `domain` | legal values 或 range;必须可枚举或可离散化 | | `order` | 对 ordered axis 的 total/partial order | | `coupling` | 与其他 axes 的 legality constraints | | `safe_region` | floor/ceiling/default/unsupported regions,带 provenance | `OperatorSpec`: | Field | 含义 | | --- | --- | | `operator_id` | stable id,例如 `topology.bracket` | | `family` | topology / kv_memory / admission / batching / allocator / failure | | `input_axes` | operator 作用的 axes | | `preconditions` | generic predicates,必须引用 state/axis/evidence,不引用 testcase | | `patch_fn` | 从 state 和 axis value 生成 candidate patch | | `coverage_effects` | 产生的 structured `CoverageUnit` | | `required_evidence` | 需要哪些 evidence 才可启用 | | `risk_model` | risk/cost term 的来源和版本 | | `confirm_reject_schema` | 如何从 measured verdict 更新 coverage | 任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。 ### Generic Operators Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。 必须支持的基础 operator: | Operator | 输入 | 输出 | 用途 | | --- | --- | --- | --- | | `step_up(axis)` | ordered axis | adjacent higher value | 验证上方 frontier | | `step_down(axis)` | ordered axis | adjacent lower value | 验证下方 frontier | | `bracket(axis)` | ordered axis + boundary/trust-region state | adjacent contrast point | 从偏置初始点建立局部 bracket | | `jump_to_floor(axis)` | bounded axis + nominal floor | floor target | 从非法/低效区间回到 safe operating range | | `local_climb(axis)` | bounded axis + step policy | next value | 在 trust region 内局部爬坡 | | `backoff(axis)` | failed higher target | lower safe value | 处理 launch/OOM/regression | | `redistribute(axis_a, axis_b)` | coupled axes + product constraints | legal coupled patch | TP/DP/EP redistribution | | `joint_adjust(axes)` | coupled runtime axes | legal joint patch | MBT/MNS 交互 | | `preserve(dimensions)` | incumbent | complete patch context | 保持 topology/runtime anchor | | `block_if_seen(signature)` | tested signatures | reject candidate | no-repeat | | `invalidate_region(failure)` | failure profile | region predicate | 避免重复失败区域 | Operator 只做 candidate construction,不做最终接受。接受由 real measurement 决定。 ### CandidateAction 每个 candidate 必须是结构化对象: ```json { "candidate_id": "topology.bracket/tp:8->4/dp:1", "family": "topology", "operator": "bracket", "patch": {"flag_patch": {"tensor-parallel-size": 4}}, "hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.", "targets": ["topology_efficiency", "ttft_prefill"], "expected_metric_movement": { "request_rate_per_gpu": "increase_or_same", "ttft_p95": "may_increase", "gpu_count": "decrease" }, "confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency", "reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance", "coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"], "risk": { "launch": "low", "regression": "medium", "measurement_cost": "one_trial" } } ``` 这比“try TP=4”严格:trial 被绑定到 hypothesis、metric movement 和 coverage update。 ### CandidateSet `CandidateSet` 是 grammar/operator 在当前 state 下的完整枚举结果,而不是 planner 看到的 前几个候选。 ```text CandidateSet = { grammar_version, policy_version, engine_capability_version, state_hash, candidates[], blocked_candidates[], candidate_set_hash } ``` 要求: 1. `OperatorRegistry.generate(state, grammar, policy)` 必须枚举所有 eligible candidates。 2. 所有 filtered candidate 必须进入 `blocked_candidates`,并带 `blocked_reason`。 3. Stop proof 必须引用 `candidate_set_hash`。 4. 同一 `HarnessState + Grammar + Policy + EngineCapability` 必须生成 byte-identical `CandidateSet`。 5. 对 toy grammars,必须能穷举验证 generator 没漏 candidate。 这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。 ### CoverageState Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”,不是只记录 best config。 Coverage units: | Unit | 含义 | | --- | --- | | `signature_tested` | exact config patch 已测 | | `operator_covered` | 某 operator 在某 family/neighborhood 上至少被验证一次 | | `neighbor_covered` | ordered lattice 的相邻上/下点已测 | | `trust_region_covered` | incumbent topology 上某 runtime family 的 local region 已测 | | `hypothesis_refuted` | candidate 的 reject condition 被满足 | | `region_invalidated` | failure memory 否定一片 region,而不是单点 | | `incumbent_validated` | incumbent 被 topology/runtime/failure dimensions 的反事实实验验证 | `CoverageUnit` 不能是自由字符串,必须是结构化 key: ```json { "unit_type": "neighbor_covered", "family": "topology", "axis": "topology.tp", "operator": "bracket", "anchor_region": {"tp": 8, "dp": 1}, "candidate_region": {"tp": 4, "dp": 1}, "bottleneck": "ttft_prefill", "grammar_version": "..." } ``` Stop 前不能只依赖 `signature_tested`。Required coverage 必须至少包含一个非平凡机制 coverage unit,例如 `neighbor_covered`、`trust_region_covered`、`region_invalidated` 或 `incumbent_validated`。 Coverage update: ```text trial_result -> MeasuredVerdict -> CoverageDelta ``` 每个 measured trial 必须至少产生一个 `CoverageDelta`: - 增加 `signature_tested`; - 增加 family/operator coverage; - invalidate failure region; - update incumbent; - refute/confirm hypothesis。 如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent,它应被标记为 `wasted_by_design`,这会直接违反验收标准。 更严格地说: ```text signature_tested alone is not enough. ``` 一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何 operator/neighborhood/trust region、没有更新 incumbent,也没有产生保守 failure invalidation,就必须标记为 `non_informative_trial`。 ### Failure invalidation `region_invalidated` 是最容易过拟合或误杀的部分,必须保守定义。 一个 failure invalidation 必须输出: ```json { "region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99", "source_failure": "engine_launch_oom", "covered_config_examples": [...], "excluded_config_examples": [...], "conservative_reason": "failure only observed at higher memory target on same topology", "retry_condition": "lower gmu or different topology is tested", "expires_after": "optional policy-defined TTL" } ``` 禁止从单点 failure 无边界外推到整个 family。例如一次 `TP=8,gmu=0.99` OOM 不能直接 invalidate 所有 `TP=8` 或所有高 TP;它最多 invalidates 包含相同或更高 memory pressure 的保守 region,除非后续 evidence 扩大该 region。 ### CoverageValidator Stop 必须相对于 coverage,而不是相对于 Python guard cascade。 Stop condition: ```text stop_allowed iff legality soundness holds and candidate_set_snapshot is complete and persisted and no candidate with harness_priority >= threshold remains uncovered and incumbent has required validation coverage and measurement budget/search-high condition is satisfied or no eligible candidate remains ``` 这里不用未定义的 `useful candidate`。一个 candidate 是否 eligible 由以下结构化条件决定: ```text eligible(candidate) iff candidate is legal and candidate is not exact-repeat and candidate.coverage_unit is not already covered and candidate.region is not invalidated and candidate.required_evidence is satisfied and candidate.harness_priority >= policy.min_candidate_priority ``` `required validation coverage` 也必须结构化声明。例如: ```yaml required_validation: incumbent: - family: topology unit_type: neighbor_covered neighborhood: adjacent - family: kv_memory unit_type: trust_region_covered when_axis_tunable: gpu-memory-utilization ``` Validator 输出必须包含: ```json { "should_stop": true, "stop_kind": "coverage_relative_stop", "candidate_set_hash": "...", "grammar_version": "...", "policy_version": "...", "engine_capability_version": "...", "covered_units": [...], "remaining_candidates": [], "blocked_candidates": [ {"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"} ], "proof_obligation": "No uncovered candidate above priority threshold under grammar version X." } ``` 这里的 proof 是 relative proof: ```text 不是证明真实世界没有更优 config; 而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。 ``` ## Planner backend 分离 Harness 产出 `CandidateSet`,planner backend 只做选择。 ```text CandidateSet = Grammar(HarnessState) decision = PlannerBackend.rank_or_select(CandidateSet) ``` Planner backend 可以包括: | Backend | 作用 | | --- | --- | | deterministic greedy | 选择最高 priority / score candidate | | LLM ranker | 读取 candidate set 和 evidence,只允许选择或组合合法 candidate | | BO/bandit | 在 intervention graph 上学习 candidate priority | | oracle replay | 用已有 grid/replay data 做离线验证 | 关键约束: - planner 不得发明 grammar 外 knob; - planner 不得绕过 validator stop; - planner 只能改变 ranking/selection,不能改变 legality/coverage; - 同一 candidate set 下,LLM 与 BO 的差异是 backend 差异,不是 harness 差异。 Priority 职责边界: - `harness_priority` 属于 harness/policy,用于 validator 判断 candidate 是否 high-priority。 - `backend_score` 属于 planner backend,用于在 eligible candidate 中排序。 - BO/bandit 可以学习 `backend_score`,但不能改变 `harness_priority`、coverage 和 legality。 - 如果需要让学习结果影响 stop threshold,必须显式进入 `PolicyVersion`,并重新生成 candidate set snapshot。 LLM 组合候选的限制: - 默认不允许 LLM 组合多个 candidates; - 如果启用组合,组合必须生成新的 `CandidateAction`; - 组合后的 patch 必须重新经过 legality validator; - 组合后的 coverage effect 不能简单取 union,必须由 grammar 中的 `joint_adjust` operator 生成; - 没有 `joint_adjust` operator 的组合不允许执行。 ## Required implementation interfaces 后续代码重构必须先落这些接口,再迁移现有 heuristic。 ```python class InterventionGrammar: @classmethod def load(cls, path: str) -> "InterventionGrammar": ... def validate(self, capability: "EngineCapability") -> None: ... class OperatorRegistry: def generate( self, state: "HarnessState", grammar: "InterventionGrammar", policy: "HarnessPolicy", ) -> "CandidateSet": ... class InterventionOperator: def apply( self, axis: "AxisSpec", state: "HarnessState", policy: "HarnessPolicy", ) -> "CandidateAction | BlockedCandidate": ... class CoverageState: def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ... class CoverageValidator: def validate_stop( self, state: "HarnessState", candidate_set: "CandidateSet", coverage: "CoverageState", policy: "HarnessPolicy", ) -> "StopReport": ... class PlannerBackend: def select( self, candidate_set: "CandidateSet", state: "HarnessState", ) -> "CandidateAction | StopRequest": ... ``` Required schemas: | Schema | 必须字段 | | --- | --- | | `CandidateSignature` | exact signature、normalized full config signature、partial patch signature、hash version | | `MeasuredVerdict` | candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs | | `CoverageDelta` | added units、invalidated regions、incumbent update、non-informative flag | | `BlockedCandidate` | candidate skeleton、blocked reason、blocking predicate、retry condition | | `StopReport` | candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation | 这些 schema 是 paper artifact 的一部分。没有它们,实验结果无法证明 harness 不是 case-specific heuristic。 ## Anti-overfitting 约束 ### 禁止项 以下内容不得出现在 harness core、grammar 或 operator 中: - model name,例如 Qwen30B、Qwen27B; - host name,例如 dash0; - case id、run id、trace window id; - known winning config,例如 “TP=2+gmu=0.97”; - 固定 SLO 数值作为控制逻辑,例如 “TPOT=50ms 时做 X”; - 根据实验结果补的特例 branch,例如 “如果 current_tp > 2 就强行 downshift”; - 针对单个 regression test 的 action id 或 score boost。 ### 允许项 以下内容可以存在,但必须声明来源和适用范围: - engine capability 中的 safe range,例如 `gpu-memory-utilization` nominal floor; - topology constraints 中的 legal lattice,例如 TP/DP/EP 值; - policy config 中的 measurement-cost / risk prior; - workload profile 中的 length/cache/arrival regime; - SLO rule 本身; - measured trial evidence。 Priority/threshold provenance: 所有 priority term、threshold、risk/cost prior 必须记录: ```text name default value source: engine docs / prior experiments / conservative default / calibrated profile scope: global / engine-family / engine-version / hardware-class sensitivity test last updated commit ``` 禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时,必须跑 sensitivity test 和 held-out scenario tests。 ### Rule 与 grammar 的边界 不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate: | 禁止 | 允许 | | --- | --- | | `if model == Qwen30B` | `if family == topology and axis.type == ordered_lattice` | | `if TP == 8 then TP=4` | `bracket(ordered_axis)` | | `if gmu == 0.5 then 0.9` | `jump_to_floor(bounded_axis)` | | `if TTFT4s/TPOT25 case` | `if bottleneck hypothesis targets TTFT/TPOT evidence` | | `score = 0.74 for this bad-start fix` | `priority = relief_prior + information_gain - risk - cost` from policy terms | ## 可证明性质 ### P1. Legality soundness 对于任意 generated candidate: ```text candidate.patch satisfies: tunable_envs/tunable_flags topology constraints engine capability safe constraints no forbidden knob ``` 测试方式: - grammar-level property tests; - random StudySpec / topology constraints fuzzing; - all candidate patches pass `validate_proposal` equivalent checks。 ### P2. No-repeat 对于任意 candidate: ```text signature(candidate.patch) not in tested_signatures ``` 如果 exact signature 被测过,candidate 必须被 filtered 或标记为 already-covered。 ### P3. Coverage monotonicity 每个 measured trial 后: ```text coverage_{t+1} >= coverage_t ``` 其中 `>=` 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence 不减少。 ### P4. Coverage-relative stop soundness 如果 validator 输出 stop: ```text for all candidate in persisted CandidateSet(grammar, state, policy): candidate.harness_priority < threshold or candidate.coverage_unit already covered or candidate.region invalidated or candidate violates constraints ``` 这是相对于 grammar 的 soundness,不是 global optimality。 额外前提: - CandidateSet 必须是完整枚举; - CandidateSet snapshot 必须持久化; - StopReport 必须包含 `candidate_set_hash`; - required coverage 不能只由 `signature_tested` 满足。 ### P5. Backend independence 给定相同 `HarnessState + Grammar + Policy`: ```text CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends. ``` 不同 backend 只能改变 `backend_score`、candidate ranking 或 tie-break。 ### P6. Auditability 每个 executed trial 必须能追溯: ```text proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta ``` 缺失任一环节则不允许作为 paper 证据。 ## 不能证明的性质 必须在 paper 中显式避免: - global optimum; - arbitrary bad-start completeness; - arbitrary workload/SLO generalization; - engine-independent optimality; - bottleneck classifier correctness; - stop implies no better real config。 可以说: ```text Within the declared intervention grammar and measurement budget, AITuner provides coverage-relative experimental control and empirically improves convergence on the reported workloads. ``` ## 验收标准 这部分是最严格的 gate。没有通过这些标准之前,不应把新 harness 当作 paper contribution。 ### A. 设计验收 1. Harness core 不允许新增 testcase-specific branch。 2. 所有 knob 行为必须来自 grammar/operator/policy config。 3. Magic constants 必须集中到 `HarnessPolicy` 或 engine capability,并说明来源。 4. Candidate generator 必须输出 `CandidateAction`,包含 hypothesis、operator、 confirm/reject condition 和 coverage effect。 5. Stop validator 必须输出 coverage-relative proof obligation。 6. Planner backend 不能绕过 candidate set 或 stop validator。 7. CandidateSet 必须完整枚举、持久化 snapshot,并由 stop report 引用。 8. `harness_priority` 和 `backend_score` 必须分离。 9. `CoverageUnit` 必须结构化,且 required coverage 不能只靠 exact signature。 10. Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。 ### B. 代码验收 1. `harness.py` 不再继续膨胀;新实现拆分为: - `harness_state.py` - `intervention_grammar.py` - `operators.py` - `coverage.py` - `planner_backend.py` - `harness_compat.py` or equivalent adapter。 2. `CandidateAction`、`CoverageState`、`CoverageDelta`、`HarnessPolicy` 使用 dataclass 或 typed schema。 3. 当前 `_topology_candidate_actions()` 和 `_runtime_candidate_actions()` 的逻辑必须迁移 到 operator registry。 4. 所有 candidate patch 必须通过 legality validator。 5. 所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。 6. 当前 regression tests 可以保留为 scenario tests,但不得通过新增 hardcoded branch 来修。 7. StopReport 必须包含 `candidate_set_hash`、grammar/policy/capability versions、 remaining/covered/blocked candidates。 8. Inconclusive measurement 不能错误地产生 `hypothesis_refuted` 或 `region_invalidated`。 ### C. 静态 anti-overfitting 验收 CI 或测试中必须检查: 1. Harness core 不包含 model name / run id / host name / known winning config 字符串。 2. Harness core 不包含 case-specific action id。 3. Score boost 不能针对单一 scenario test。 4. Grammar 文件只能引用 family/operator/axis,不引用 testcase。 5. 新增 scenario failure 时,PR 必须说明是补 grammar/operator,还是调整 policy config; 不允许直接补 testcase branch。 6. Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。 7. 新增或修改 policy threshold 必须包含 sensitivity test。 8. 新增 operator/policy 后必须跑 held-out scenarios,不能只在触发该改动的 scenario 上通过。 ### D. 单元测试验收 必须新增 grammar-level tests: 1. Legal candidate generation under random topology constraints。 2. No-repeat property。 3. Coverage monotonicity。 4. Stop iff no uncovered high-priority candidate。 5. Failure memory invalidates regions, not just exact signatures。 6. Backend independence: greedy vs mock-LLM ranker candidate set 一致。 7. Bad-start property tests: - 对 ordered lattice 的 boundary starts,`bracket` 生成相邻 contrast candidate; - 对 bounded numeric axis 的 below-floor starts,`jump_to_floor` 生成 floor candidate; - 测试不得写具体 `TP=8 -> TP=4` 或 `0.5 -> 0.9` 作为唯一条件,而应参数化。 8. Candidate set determinism: 同一输入输出 byte-identical snapshot。 9. Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。 10. Combined candidate legality: 如果启用组合,组合后的 patch 和 coverage 必须合法。 11. Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。 12. Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。 13. Stop falsification: 对已 stop 的 history,用 local grid/replay 检查是否漏掉 grammar 内 high-priority candidate。 ### E. 实验验收 Paper 证据必须至少包含: 1. **Planner-agnostic ablation** - raw greedy / BO vs grammar-guided greedy / BO; - weak LLM naive vs weak LLM + harness; - strong LLM naive vs strong LLM + harness。 2. **Mechanism ablation** - full grammar; - no attribution; - shuffled attribution; - no grammar, raw knobs; - no coverage validator; - no failure memory。 3. **Bad-start robustness** - 多个 random/adversarial starts; - 不是只跑 `TP=8,gmu=0.5,max-num-seqs=8`; - 报告 convergence distribution,而不是单条成功路径。 4. **Near-optimum check** - 至少一个 case 做 local grid/expert comparison; - 报告 AITuner best 与 local grid best 的 gap。 5. **Cross-regime check** - 至少两个不同 regime:long-prefill/tight-TTFT 和 decode/admission-heavy。 ### F. Paper claim 验收 不通过上述实验前,paper 只能说: ```text The current implementation demonstrates the feasibility of a deterministic, mechanism-guided proposal loop on selected cases. ``` 通过设计和代码验收后,可以说: ```text AITuner implements a planner-agnostic intervention grammar with coverage-relative stop authority. ``` 通过实验验收后,才可以说: ```text The harness improves fixed-budget tuning and convergence robustness across the reported regimes, and the gain is not attributable solely to a stronger LLM or case-specific prompt engineering. ``` 仍然不能说: ```text The harness is complete over all configs. The harness guarantees global optimum. The harness is robust to arbitrary workloads/engines. ``` ## 迁移计划 ### Phase 0: 冻结 rule growth - 暂停向 `harness.py` 添加新的 testcase-specific rules。 - 所有新 bad case 先写成 missing grammar/operator issue。 - Roadmap 中标记当前 rule-based harness 为 prototype。 ### Phase 1: Typed state and candidate schema - 抽出 `TrialProfile`、`BottleneckHypothesis`、`CandidateAction`、`CoverageState`、 `HarnessPolicy`。 - 同时抽出 `AxisSpec`、`OperatorSpec`、`CandidateSet`、`CoverageUnit`、 `CoverageDelta`、`MeasuredVerdict`、`BlockedCandidate`、`StopReport`。 - 现有行为保持兼容,只改变结构。 - Golden tests 使用当前 outputs 防止 accidental regression。 ### Phase 2: Operator registry - 先迁移 topology、kv_memory、admission、batching 四类。 - 每个 operator 负责生成 candidate 和 coverage effect。 - 旧 `_topology_candidate_actions()`、`_runtime_candidate_actions()` 变成 compat wrapper。 ### Phase 3: Coverage validator - 实现 coverage accounting。 - StopPolicy 先复现当前 guard 结果,再替换 family-count heuristic。 - 新增 coverage-relative stop tests。 ### Phase 4: Planner backend split - Greedy planner 作为默认 backend。 - LLM backend 只能 rank/select candidate。 - BO/bandit backend 可后续接入。 ### Phase 5: Experiment rerun - 重跑 no-LLM Qwen30B; - 重跑 2x2; - 跑 bad-start distribution; - 跑 mechanism ablation; - 跑 local grid/expert comparison。 ## 最关键的判断标准 如果一个新 failure 只能通过下面方式修复: ```text 在 planner 里新增一个针对具体 config/workload 的 if-else ``` 那么这不是 harness contribution。 如果一个新 failure 通过下面方式修复: ```text 新增或修正一个通用 operator / coverage predicate / engine capability declaration, 并且该变化改善一类 parameterized tests ``` 它才可能是 harness contribution。 这条标准应该作为后续所有 PR 和实验解释的 gate。