gahow/aituner

Fork 0

Files

Gahow Wang 4075c7abf0 Design declarative intervention harness

2026-06-26 17:15:06 +08:00

34 KiB

Raw Blame History

Declarative Intervention Harness Design - 2026-06-26

本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的 if/else 调参规则，并明确哪些性质可以被证明，哪些 claim 必须通过实验验证，哪些 claim 不能说。

结论先说清楚：

harness 的贡献不是“专家规则知道该怎么调 vLLM”。

harness 的贡献应该是：
把 raw knob optimization 转成 declarative, coverage-aware experimental design。

也就是说，harness 不应该直接 hardcode 下一个答案；它应该定义：

什么状态是可观测的；
什么 intervention 是合法的；
每个 intervention 验证哪个系统假设；
哪些区域已经被测量、否定或覆盖；
什么时候 planner 还能继续选实验；
什么时候 stop 是 coverage-relative sound。

Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些 planner 共享的 substrate，不是某个 planner 的 prompt trick。

Independent review verdict

本文档初稿经过 fresh subagent 独立 review，结论是：

accept with major revisions

审稿意见的核心警告是：把规则从 Python if/else 搬到 grammar、operator、policy 或 priority 中，仍然可能只是“换皮的 rule-based harness”。因此，本设计必须额外满足：

candidate set 必须完整枚举，并持久化 snapshot；
priority/threshold 必须有 provenance、版本和 sensitivity test；
coverage unit 必须结构化，不能只靠 signature_tested；
failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件；
stop proof 必须引用 candidate set snapshot；
grammar/policy/capability 也必须接受 anti-overfitting 静态检查；
LLM/BO 只能选择 candidate，不能暗中改变 legality、coverage 或 stop authority。

下面的设计已经把这些 major revisions 纳入硬性要求。

当前问题

当前 src/aituner/harness.py 已经具备了一些正确的抽象词汇：observation、 bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based：

observation、bottleneck attribution、candidate generation、scoring、stop validator 都集中在一个 Python 文件里；
knob family 的行为由 Python 分支和固定阈值决定；
scoring 使用人工常数；
stop validator 是 guard cascade，不是显式 coverage proof；
每次发现坏 case，容易继续加一个新 branch，例如 bad high-TP start 或 low-gmu start。

这类实现可以做工程原型，但不能成为系统贡献。它最多证明：

我们为已见过的一些 case 写出了一组能工作的 heuristic。

它不能证明：

harness 是通用的；
harness 对任意 bad start 都 robust；
stop 时不存在未测更优 config；
candidate generator 覆盖了所有重要 intervention；
AITuner 的收益不是来自 testcase-specific rule accumulation。

因此，后续不能继续用“给 failure mode 补 rule”的方式推进。

设计目标

新的 harness 需要满足五个目标。

Declarative 系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在，而不是散落在 Python if/else 中。
Planner-agnostic Harness 负责 legality、candidate construction、coverage accounting 和 stop authority；planner 只负责在合法 candidate set 上排序或选择。
Coverage-aware Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确 coverage 语义。
Anti-overfitting 设计中不得出现 model name、case id、known winning config、固定 workload id 或实验结果常数。Case 只能通过 StudySpec、engine capability、workload profile 和 trial measurements 进入系统。
Auditable 每个 proposal 必须能回答：

这个 trial 验证哪个 hypothesis？
它改变哪个 intervention dimension？
它的 confirm/reject condition 是什么？
trial 结束后 coverage 如何变化？

非目标

我们不追求也不应 claim 以下目标：

global optimum；
任意 config 空间的 completeness；
任意 serving engine/workload/SLO 的 robustness；
bottleneck classifier 永远正确；
stop 时真实世界不存在任何更优 config。

我们追求的是更严格但可证明的相对性质：

在声明的 grammar、operator set、engine constraints 和 measurement budget 下，
harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。

新架构

StudySpec + EngineCapability + WorkloadProfile + TrialHistory
        |
        v
HarnessState
        |
        v
InterventionGrammar
        |
        v
OperatorRegistry -----> CandidateSet
        |                    |
        v                    v
CoverageState <------ MeasuredVerdict
        |
        v
CoverageValidator
        |
        v
PlannerBackend
        |
        v
Proposal or Stop

版本化对象：

Object	必须版本化	进入哪些输出
`GrammarVersion`	yes	candidate id、candidate set snapshot、coverage delta、stop report
`PolicyVersion`	yes	candidate priority、blocked reason、stop threshold、stop report
`EngineCapabilityVersion`	yes	axis domain、safe range、candidate patch、failure invalidation
`PlannerBackendVersion`	yes	selected candidate、ranking trace

没有版本号的 candidate/stop 不能作为 paper 证据。

HarnessState

HarnessState 是 planner 看到的唯一状态入口。它由结构化数据构成：

Field	来源	作用
`study`	`StudySpec`	tunable schema、topology constraints、SLO、search budget
`engine_capability`	static profile / engine adapter	safe ranges、knob lattice、unsupported combinations
`workload_profile`	trace summary / L-C-A profile	length/cache/arrival regime
`trial_profiles`	measured results	latency、SLO failures、throughput、launch failures
`incumbent`	state best	current best config and measured objective
`tested_signatures`	trial history	no-repeat and coverage accounting
`failure_memory`	launch/runtime failures	invalidate region or operator dimensions

重要约束：

HarnessState 不包含 natural-language prompt-only state；
所有字段必须可序列化、可 snapshot、可 replay；
同一个 HarnessState + Grammar + Policy 必须产生同一个 candidate set。

InterventionGrammar

Grammar 描述 serving tuning 的 intervention space。它不选择候选，只声明：

family；
axes；
legal value source；
generic operators；
expected effect schema；
risk schema；
coverage dimensions；
required evidence。

建议初始 grammar：

Family	Axes	Operators	Coverage dimension
`topology`	TP, DP, EP, EP enable	`step_up`, `step_down`, `redistribute`, `bracket`	topology lattice neighborhood
`kv_memory`	`gpu-memory-utilization`	`jump_to_floor`, `local_climb`, `backoff_after_failure`	runtime trust region on incumbent topology
`admission`	`max-num-seqs`	`raise`, `lower`, `bracket`	sequence concurrency region
`batching`	`max-num-batched-tokens`, chunked prefill	`raise`, `lower`, `joint_adjust`	prefill/decode batching region
`allocator`	block size / cache knobs	`switch_categorical`, `bracket`	memory layout region
`failure`	failed signatures/regions	`invalidate_region`, `avoid_region`	failure memory coverage

Grammar example:

family: kv_memory
axes:
  gpu-memory-utilization:
    type: bounded_float
    value_source: engine_capability
    nominal_floor_key: gmu_nominal_floor
    safe_ceiling_key: gmu_safe_ceiling
operators:
  - name: jump_to_floor
    preconditions:
      - axis_below_nominal_floor
      - incumbent_topology_preserved
    coverage_effect: covers_runtime_floor
  - name: local_climb
    preconditions:
      - axis_at_or_above_nominal_floor
      - no_failed_higher_target
    coverage_effect: extends_runtime_trust_region

这里没有写 0.5 -> 0.9。0.9 来自 engine capability/policy config， jump_to_floor 是 bounded numeric axis 的通用 operator。

Topology example:

family: topology
axes:
  tensor-parallel-size:
    type: ordered_lattice
    value_source: topology_constraints
  data-parallel-size:
    type: ordered_lattice
    value_source: topology_constraints
operators:
  - name: step_up
    coverage_effect: covers_upper_neighbor
  - name: step_down
    coverage_effect: covers_lower_neighbor
  - name: bracket
    preconditions:
      - anchor_at_boundary_or_unknown_region
      - neighbor_uncovered
    coverage_effect: brackets_ordered_axis
  - name: redistribute
    coverage_effect: covers_same_product_tp_dp_neighbor

这里没有写 TP=8 -> TP=4。它是 ordered lattice 上的 bracket。

AxisSpec and OperatorSpec

Grammar 必须由 typed schema 表达，不能只是自由文本。

AxisSpec：

Field	含义
`axis_id`	stable id，例如 `topology.tp`
`knob_keys`	该 axis lower 到 config patch 时影响的 knobs
`type`	`ordered_lattice`, `bounded_float`, `bounded_int`, `categorical`, `coupled`
`domain_source`	`topology_constraints`, `engine_capability`, `policy_config`, `study_spec`
`domain`	legal values 或 range；必须可枚举或可离散化
`order`	对 ordered axis 的 total/partial order
`coupling`	与其他 axes 的 legality constraints
`safe_region`	floor/ceiling/default/unsupported regions，带 provenance

OperatorSpec：

Field	含义
`operator_id`	stable id，例如 `topology.bracket`
`family`	topology / kv_memory / admission / batching / allocator / failure
`input_axes`	operator 作用的 axes
`preconditions`	generic predicates，必须引用 state/axis/evidence，不引用 testcase
`patch_fn`	从 state 和 axis value 生成 candidate patch
`coverage_effects`	产生的 structured `CoverageUnit`
`required_evidence`	需要哪些 evidence 才可启用
`risk_model`	risk/cost term 的来源和版本
`confirm_reject_schema`	如何从 measured verdict 更新 coverage

任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。

Generic Operators

Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。

必须支持的基础 operator：

Operator	输入	输出	用途
`step_up(axis)`	ordered axis	adjacent higher value	验证上方 frontier
`step_down(axis)`	ordered axis	adjacent lower value	验证下方 frontier
`bracket(axis)`	ordered axis + boundary/trust-region state	adjacent contrast point	从偏置初始点建立局部 bracket
`jump_to_floor(axis)`	bounded axis + nominal floor	floor target	从非法/低效区间回到 safe operating range
`local_climb(axis)`	bounded axis + step policy	next value	在 trust region 内局部爬坡
`backoff(axis)`	failed higher target	lower safe value	处理 launch/OOM/regression
`redistribute(axis_a, axis_b)`	coupled axes + product constraints	legal coupled patch	TP/DP/EP redistribution
`joint_adjust(axes)`	coupled runtime axes	legal joint patch	MBT/MNS 交互
`preserve(dimensions)`	incumbent	complete patch context	保持 topology/runtime anchor
`block_if_seen(signature)`	tested signatures	reject candidate	no-repeat
`invalidate_region(failure)`	failure profile	region predicate	避免重复失败区域

Operator 只做 candidate construction，不做最终接受。接受由 real measurement 决定。

CandidateAction

每个 candidate 必须是结构化对象：

{
  "candidate_id": "topology.bracket/tp:8->4/dp:1",
  "family": "topology",
  "operator": "bracket",
  "patch": {"flag_patch": {"tensor-parallel-size": 4}},
  "hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.",
  "targets": ["topology_efficiency", "ttft_prefill"],
  "expected_metric_movement": {
    "request_rate_per_gpu": "increase_or_same",
    "ttft_p95": "may_increase",
    "gpu_count": "decrease"
  },
  "confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency",
  "reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance",
  "coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"],
  "risk": {
    "launch": "low",
    "regression": "medium",
    "measurement_cost": "one_trial"
  }
}

这比“try TP=4”严格：trial 被绑定到 hypothesis、metric movement 和 coverage update。

CandidateSet

CandidateSet 是 grammar/operator 在当前 state 下的完整枚举结果，而不是 planner 看到的前几个候选。

CandidateSet = {
  grammar_version,
  policy_version,
  engine_capability_version,
  state_hash,
  candidates[],
  blocked_candidates[],
  candidate_set_hash
}

要求：

OperatorRegistry.generate(state, grammar, policy) 必须枚举所有 eligible candidates。
所有 filtered candidate 必须进入 blocked_candidates，并带 blocked_reason。
Stop proof 必须引用 candidate_set_hash。
同一 HarnessState + Grammar + Policy + EngineCapability 必须生成 byte-identical CandidateSet。
对 toy grammars，必须能穷举验证 generator 没漏 candidate。

这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。

CoverageState

Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”，不是只记录 best config。

Coverage units:

Unit	含义
`signature_tested`	exact config patch 已测
`operator_covered`	某 operator 在某 family/neighborhood 上至少被验证一次
`neighbor_covered`	ordered lattice 的相邻上/下点已测
`trust_region_covered`	incumbent topology 上某 runtime family 的 local region 已测
`hypothesis_refuted`	candidate 的 reject condition 被满足
`region_invalidated`	failure memory 否定一片 region，而不是单点
`incumbent_validated`	incumbent 被 topology/runtime/failure dimensions 的反事实实验验证

CoverageUnit 不能是自由字符串，必须是结构化 key：

{
  "unit_type": "neighbor_covered",
  "family": "topology",
  "axis": "topology.tp",
  "operator": "bracket",
  "anchor_region": {"tp": 8, "dp": 1},
  "candidate_region": {"tp": 4, "dp": 1},
  "bottleneck": "ttft_prefill",
  "grammar_version": "..."
}

Stop 前不能只依赖 signature_tested。Required coverage 必须至少包含一个非平凡机制 coverage unit，例如 neighbor_covered、trust_region_covered、region_invalidated 或 incumbent_validated。

Coverage update:

trial_result -> MeasuredVerdict -> CoverageDelta

每个 measured trial 必须至少产生一个 CoverageDelta：

增加 signature_tested；
增加 family/operator coverage；
invalidate failure region；
update incumbent；
refute/confirm hypothesis。

如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent，它应被标记为 wasted_by_design，这会直接违反验收标准。

更严格地说：

signature_tested alone is not enough.

一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何 operator/neighborhood/trust region、没有更新 incumbent，也没有产生保守 failure invalidation，就必须标记为 non_informative_trial。

Failure invalidation

region_invalidated 是最容易过拟合或误杀的部分，必须保守定义。

一个 failure invalidation 必须输出：

{
  "region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99",
  "source_failure": "engine_launch_oom",
  "covered_config_examples": [...],
  "excluded_config_examples": [...],
  "conservative_reason": "failure only observed at higher memory target on same topology",
  "retry_condition": "lower gmu or different topology is tested",
  "expires_after": "optional policy-defined TTL"
}

禁止从单点 failure 无边界外推到整个 family。例如一次 TP=8,gmu=0.99 OOM 不能直接 invalidate 所有 TP=8 或所有高 TP；它最多 invalidates 包含相同或更高 memory pressure 的保守 region，除非后续 evidence 扩大该 region。

CoverageValidator

Stop 必须相对于 coverage，而不是相对于 Python guard cascade。

Stop condition:

stop_allowed iff
  legality soundness holds
  and candidate_set_snapshot is complete and persisted
  and no candidate with harness_priority >= threshold remains uncovered
  and incumbent has required validation coverage
  and measurement budget/search-high condition is satisfied or no eligible candidate remains

这里不用未定义的 useful candidate。一个 candidate 是否 eligible 由以下结构化条件决定：

eligible(candidate) iff
  candidate is legal
  and candidate is not exact-repeat
  and candidate.coverage_unit is not already covered
  and candidate.region is not invalidated
  and candidate.required_evidence is satisfied
  and candidate.harness_priority >= policy.min_candidate_priority

required validation coverage 也必须结构化声明。例如：

required_validation:
  incumbent:
    - family: topology
      unit_type: neighbor_covered
      neighborhood: adjacent
    - family: kv_memory
      unit_type: trust_region_covered
      when_axis_tunable: gpu-memory-utilization

Validator 输出必须包含：

{
  "should_stop": true,
  "stop_kind": "coverage_relative_stop",
  "candidate_set_hash": "...",
  "grammar_version": "...",
  "policy_version": "...",
  "engine_capability_version": "...",
  "covered_units": [...],
  "remaining_candidates": [],
  "blocked_candidates": [
    {"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"}
  ],
  "proof_obligation": "No uncovered candidate above priority threshold under grammar version X."
}

这里的 proof 是 relative proof：

不是证明真实世界没有更优 config；
而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。

Planner backend 分离

Harness 产出 CandidateSet，planner backend 只做选择。

CandidateSet = Grammar(HarnessState)
decision = PlannerBackend.rank_or_select(CandidateSet)

Planner backend 可以包括：

Backend	作用
deterministic greedy	选择最高 priority / score candidate
LLM ranker	读取 candidate set 和 evidence，只允许选择或组合合法 candidate
BO/bandit	在 intervention graph 上学习 candidate priority
oracle replay	用已有 grid/replay data 做离线验证

关键约束：

planner 不得发明 grammar 外 knob；
planner 不得绕过 validator stop；
planner 只能改变 ranking/selection，不能改变 legality/coverage；
同一 candidate set 下，LLM 与 BO 的差异是 backend 差异，不是 harness 差异。

Priority 职责边界：

harness_priority 属于 harness/policy，用于 validator 判断 candidate 是否 high-priority。
backend_score 属于 planner backend，用于在 eligible candidate 中排序。
BO/bandit 可以学习 backend_score，但不能改变 harness_priority、coverage 和 legality。
如果需要让学习结果影响 stop threshold，必须显式进入 PolicyVersion，并重新生成 candidate set snapshot。

LLM 组合候选的限制：

默认不允许 LLM 组合多个 candidates；
如果启用组合，组合必须生成新的 CandidateAction；
组合后的 patch 必须重新经过 legality validator；
组合后的 coverage effect 不能简单取 union，必须由 grammar 中的 joint_adjust operator 生成；
没有 joint_adjust operator 的组合不允许执行。

Required implementation interfaces

后续代码重构必须先落这些接口，再迁移现有 heuristic。

class InterventionGrammar:
    @classmethod
    def load(cls, path: str) -> "InterventionGrammar": ...
    def validate(self, capability: "EngineCapability") -> None: ...


class OperatorRegistry:
    def generate(
        self,
        state: "HarnessState",
        grammar: "InterventionGrammar",
        policy: "HarnessPolicy",
    ) -> "CandidateSet": ...


class InterventionOperator:
    def apply(
        self,
        axis: "AxisSpec",
        state: "HarnessState",
        policy: "HarnessPolicy",
    ) -> "CandidateAction | BlockedCandidate": ...


class CoverageState:
    def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ...


class CoverageValidator:
    def validate_stop(
        self,
        state: "HarnessState",
        candidate_set: "CandidateSet",
        coverage: "CoverageState",
        policy: "HarnessPolicy",
    ) -> "StopReport": ...


class PlannerBackend:
    def select(
        self,
        candidate_set: "CandidateSet",
        state: "HarnessState",
    ) -> "CandidateAction | StopRequest": ...

Required schemas:

Schema	必须字段
`CandidateSignature`	exact signature、normalized full config signature、partial patch signature、hash version
`MeasuredVerdict`	candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs
`CoverageDelta`	added units、invalidated regions、incumbent update、non-informative flag
`BlockedCandidate`	candidate skeleton、blocked reason、blocking predicate、retry condition
`StopReport`	candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation

这些 schema 是 paper artifact 的一部分。没有它们，实验结果无法证明 harness 不是 case-specific heuristic。

Anti-overfitting 约束

禁止项

以下内容不得出现在 harness core、grammar 或 operator 中：

model name，例如 Qwen30B、Qwen27B；
host name，例如 dash0；
case id、run id、trace window id；
known winning config，例如 “TP=2+gmu=0.97”；
固定 SLO 数值作为控制逻辑，例如 “TPOT=50ms 时做 X”；
根据实验结果补的特例 branch，例如 “如果 current_tp > 2 就强行 downshift”；
针对单个 regression test 的 action id 或 score boost。

允许项

以下内容可以存在，但必须声明来源和适用范围：

engine capability 中的 safe range，例如 gpu-memory-utilization nominal floor；
topology constraints 中的 legal lattice，例如 TP/DP/EP 值；
policy config 中的 measurement-cost / risk prior；
workload profile 中的 length/cache/arrival regime；
SLO rule 本身；
measured trial evidence。

Priority/threshold provenance：

所有 priority term、threshold、risk/cost prior 必须记录：

name
default value
source: engine docs / prior experiments / conservative default / calibrated profile
scope: global / engine-family / engine-version / hardware-class
sensitivity test
last updated commit

禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时，必须跑 sensitivity test 和 held-out scenario tests。

Rule 与 grammar 的边界

不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate：

禁止	允许
`if model == Qwen30B`	`if family == topology and axis.type == ordered_lattice`
`if TP == 8 then TP=4`	`bracket(ordered_axis)`
`if gmu == 0.5 then 0.9`	`jump_to_floor(bounded_axis)`
`if TTFT4s/TPOT25 case`	`if bottleneck hypothesis targets TTFT/TPOT evidence`
`score = 0.74 for this bad-start fix`	`priority = relief_prior + information_gain - risk - cost` from policy terms

可证明性质

P1. Legality soundness

对于任意 generated candidate：

candidate.patch satisfies:
  tunable_envs/tunable_flags
  topology constraints
  engine capability safe constraints
  no forbidden knob

测试方式：

grammar-level property tests；
random StudySpec / topology constraints fuzzing；
all candidate patches pass validate_proposal equivalent checks。

P2. No-repeat

对于任意 candidate：

signature(candidate.patch) not in tested_signatures

如果 exact signature 被测过，candidate 必须被 filtered 或标记为 already-covered。

P3. Coverage monotonicity

每个 measured trial 后：

coverage_{t+1} >= coverage_t

其中 >= 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence 不减少。

P4. Coverage-relative stop soundness

如果 validator 输出 stop：

for all candidate in persisted CandidateSet(grammar, state, policy):
  candidate.harness_priority < threshold
  or candidate.coverage_unit already covered
  or candidate.region invalidated
  or candidate violates constraints

这是相对于 grammar 的 soundness，不是 global optimality。

额外前提：

CandidateSet 必须是完整枚举；
CandidateSet snapshot 必须持久化；
StopReport 必须包含 candidate_set_hash；
required coverage 不能只由 signature_tested 满足。

P5. Backend independence

给定相同 HarnessState + Grammar + Policy：

CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends.

不同 backend 只能改变 backend_score、candidate ranking 或 tie-break。

P6. Auditability

每个 executed trial 必须能追溯：

proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta

缺失任一环节则不允许作为 paper 证据。

不能证明的性质

必须在 paper 中显式避免：

global optimum；
arbitrary bad-start completeness；
arbitrary workload/SLO generalization；
engine-independent optimality；
bottleneck classifier correctness；
stop implies no better real config。

可以说：

Within the declared intervention grammar and measurement budget, AITuner provides
coverage-relative experimental control and empirically improves convergence on the
reported workloads.

验收标准

这部分是最严格的 gate。没有通过这些标准之前，不应把新 harness 当作 paper contribution。

A. 设计验收

Harness core 不允许新增 testcase-specific branch。
所有 knob 行为必须来自 grammar/operator/policy config。
Magic constants 必须集中到 HarnessPolicy 或 engine capability，并说明来源。
Candidate generator 必须输出 CandidateAction，包含 hypothesis、operator、 confirm/reject condition 和 coverage effect。
Stop validator 必须输出 coverage-relative proof obligation。
Planner backend 不能绕过 candidate set 或 stop validator。
CandidateSet 必须完整枚举、持久化 snapshot，并由 stop report 引用。
harness_priority 和 backend_score 必须分离。
CoverageUnit 必须结构化，且 required coverage 不能只靠 exact signature。
Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。

B. 代码验收

harness.py 不再继续膨胀；新实现拆分为：
- harness_state.py
- intervention_grammar.py
- operators.py
- coverage.py
- planner_backend.py
- harness_compat.py or equivalent adapter。
CandidateAction、CoverageState、CoverageDelta、HarnessPolicy 使用 dataclass 或 typed schema。
当前 _topology_candidate_actions() 和 _runtime_candidate_actions() 的逻辑必须迁移到 operator registry。
所有 candidate patch 必须通过 legality validator。
所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。
当前 regression tests 可以保留为 scenario tests，但不得通过新增 hardcoded branch 来修。
StopReport 必须包含 candidate_set_hash、grammar/policy/capability versions、 remaining/covered/blocked candidates。
Inconclusive measurement 不能错误地产生 hypothesis_refuted 或 region_invalidated。

C. 静态 anti-overfitting 验收

CI 或测试中必须检查：

Harness core 不包含 model name / run id / host name / known winning config 字符串。
Harness core 不包含 case-specific action id。
Score boost 不能针对单一 scenario test。
Grammar 文件只能引用 family/operator/axis，不引用 testcase。
新增 scenario failure 时，PR 必须说明是补 grammar/operator，还是调整 policy config；不允许直接补 testcase branch。
Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。
新增或修改 policy threshold 必须包含 sensitivity test。
新增 operator/policy 后必须跑 held-out scenarios，不能只在触发该改动的 scenario 上通过。

D. 单元测试验收

必须新增 grammar-level tests：

Legal candidate generation under random topology constraints。
No-repeat property。
Coverage monotonicity。
Stop iff no uncovered high-priority candidate。
Failure memory invalidates regions, not just exact signatures。
Backend independence: greedy vs mock-LLM ranker candidate set 一致。
Bad-start property tests:
- 对 ordered lattice 的 boundary starts，bracket 生成相邻 contrast candidate；
- 对 bounded numeric axis 的 below-floor starts，jump_to_floor 生成 floor candidate；
- 测试不得写具体 TP=8 -> TP=4 或 0.5 -> 0.9 作为唯一条件，而应参数化。
Candidate set determinism: 同一输入输出 byte-identical snapshot。
Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。
Combined candidate legality: 如果启用组合，组合后的 patch 和 coverage 必须合法。
Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。
Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。
Stop falsification: 对已 stop 的 history，用 local grid/replay 检查是否漏掉 grammar 内 high-priority candidate。

E. 实验验收

Paper 证据必须至少包含：

Planner-agnostic ablation
- raw greedy / BO vs grammar-guided greedy / BO；
- weak LLM naive vs weak LLM + harness；
- strong LLM naive vs strong LLM + harness。
Mechanism ablation
- full grammar；
- no attribution；
- shuffled attribution；
- no grammar, raw knobs；
- no coverage validator；
- no failure memory。
Bad-start robustness
- 多个 random/adversarial starts；
- 不是只跑 TP=8,gmu=0.5,max-num-seqs=8；
- 报告 convergence distribution，而不是单条成功路径。
Near-optimum check
- 至少一个 case 做 local grid/expert comparison；
- 报告 AITuner best 与 local grid best 的 gap。
Cross-regime check
- 至少两个不同 regime：long-prefill/tight-TTFT 和 decode/admission-heavy。

F. Paper claim 验收

不通过上述实验前，paper 只能说：

The current implementation demonstrates the feasibility of a deterministic,
mechanism-guided proposal loop on selected cases.

通过设计和代码验收后，可以说：

AITuner implements a planner-agnostic intervention grammar with coverage-relative
stop authority.

通过实验验收后，才可以说：

The harness improves fixed-budget tuning and convergence robustness across the
reported regimes, and the gain is not attributable solely to a stronger LLM or
case-specific prompt engineering.

仍然不能说：

The harness is complete over all configs.
The harness guarantees global optimum.
The harness is robust to arbitrary workloads/engines.

迁移计划

Phase 0: 冻结 rule growth

暂停向 harness.py 添加新的 testcase-specific rules。
所有新 bad case 先写成 missing grammar/operator issue。
Roadmap 中标记当前 rule-based harness 为 prototype。

Phase 1: Typed state and candidate schema

抽出 TrialProfile、BottleneckHypothesis、CandidateAction、CoverageState、 HarnessPolicy。
同时抽出 AxisSpec、OperatorSpec、CandidateSet、CoverageUnit、 CoverageDelta、MeasuredVerdict、BlockedCandidate、StopReport。
现有行为保持兼容，只改变结构。
Golden tests 使用当前 outputs 防止 accidental regression。

Phase 2: Operator registry

先迁移 topology、kv_memory、admission、batching 四类。
每个 operator 负责生成 candidate 和 coverage effect。
旧 _topology_candidate_actions()、_runtime_candidate_actions() 变成 compat wrapper。

Phase 3: Coverage validator

实现 coverage accounting。
StopPolicy 先复现当前 guard 结果，再替换 family-count heuristic。
新增 coverage-relative stop tests。

Phase 4: Planner backend split

Greedy planner 作为默认 backend。
LLM backend 只能 rank/select candidate。
BO/bandit backend 可后续接入。

Phase 5: Experiment rerun

重跑 no-LLM Qwen30B；
重跑 2x2；
跑 bad-start distribution；
跑 mechanism ablation；
跑 local grid/expert comparison。

最关键的判断标准

如果一个新 failure 只能通过下面方式修复：

在 planner 里新增一个针对具体 config/workload 的 if-else

那么这不是 harness contribution。

如果一个新 failure 通过下面方式修复：

新增或修正一个通用 operator / coverage predicate / engine capability declaration，
并且该变化改善一类 parameterized tests

它才可能是 harness contribution。

这条标准应该作为后续所有 PR 和实验解释的 gate。

34 KiB Raw Blame History Unescape Escape