Files
aituner/docs/harness-ablation/declarative-intervention-harness-design-20260626.md

34 KiB
Raw Blame History

Declarative Intervention Harness Design - 2026-06-26

本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的 if/else 调参规则,并明确哪些性质可以被证明,哪些 claim 必须通过实验验证,哪些 claim 不能说。

结论先说清楚:

harness 的贡献不是“专家规则知道该怎么调 vLLM”。

harness 的贡献应该是:
把 raw knob optimization 转成 declarative, coverage-aware experimental design。

也就是说harness 不应该直接 hardcode 下一个答案;它应该定义:

  • 什么状态是可观测的;
  • 什么 intervention 是合法的;
  • 每个 intervention 验证哪个系统假设;
  • 哪些区域已经被测量、否定或覆盖;
  • 什么时候 planner 还能继续选实验;
  • 什么时候 stop 是 coverage-relative sound。

Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些 planner 共享的 substrate不是某个 planner 的 prompt trick。

Independent review verdict

本文档初稿经过 fresh subagent 独立 review结论是

accept with major revisions

审稿意见的核心警告是:把规则从 Python if/else 搬到 grammar、operator、policy 或 priority 中,仍然可能只是“换皮的 rule-based harness”。因此本设计必须额外满足

  • candidate set 必须完整枚举,并持久化 snapshot
  • priority/threshold 必须有 provenance、版本和 sensitivity test
  • coverage unit 必须结构化,不能只靠 signature_tested
  • failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件;
  • stop proof 必须引用 candidate set snapshot
  • grammar/policy/capability 也必须接受 anti-overfitting 静态检查;
  • LLM/BO 只能选择 candidate不能暗中改变 legality、coverage 或 stop authority。

下面的设计已经把这些 major revisions 纳入硬性要求。

当前问题

当前 src/aituner/harness.py 已经具备了一些正确的抽象词汇observation、 bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based

  • observation、bottleneck attribution、candidate generation、scoring、stop validator 都集中在一个 Python 文件里;
  • knob family 的行为由 Python 分支和固定阈值决定;
  • scoring 使用人工常数;
  • stop validator 是 guard cascade不是显式 coverage proof
  • 每次发现坏 case容易继续加一个新 branch例如 bad high-TP start 或 low-gmu start。

这类实现可以做工程原型,但不能成为系统贡献。它最多证明:

我们为已见过的一些 case 写出了一组能工作的 heuristic。

它不能证明:

  • harness 是通用的;
  • harness 对任意 bad start 都 robust
  • stop 时不存在未测更优 config
  • candidate generator 覆盖了所有重要 intervention
  • AITuner 的收益不是来自 testcase-specific rule accumulation。

因此,后续不能继续用“给 failure mode 补 rule”的方式推进。

设计目标

新的 harness 需要满足五个目标。

  1. Declarative 系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在, 而不是散落在 Python if/else 中。

  2. Planner-agnostic Harness 负责 legality、candidate construction、coverage accounting 和 stop authorityplanner 只负责在合法 candidate set 上排序或选择。

  3. Coverage-aware Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确 coverage 语义。

  4. Anti-overfitting 设计中不得出现 model name、case id、known winning config、固定 workload id 或实验 结果常数。Case 只能通过 StudySpec、engine capability、workload profile 和 trial measurements 进入系统。

  5. Auditable 每个 proposal 必须能回答:

这个 trial 验证哪个 hypothesis
它改变哪个 intervention dimension
它的 confirm/reject condition 是什么?
trial 结束后 coverage 如何变化?

非目标

我们不追求也不应 claim 以下目标:

  • global optimum
  • 任意 config 空间的 completeness
  • 任意 serving engine/workload/SLO 的 robustness
  • bottleneck classifier 永远正确;
  • stop 时真实世界不存在任何更优 config。

我们追求的是更严格但可证明的相对性质:

在声明的 grammar、operator set、engine constraints 和 measurement budget 下,
harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。

新架构

StudySpec + EngineCapability + WorkloadProfile + TrialHistory
        |
        v
HarnessState
        |
        v
InterventionGrammar
        |
        v
OperatorRegistry -----> CandidateSet
        |                    |
        v                    v
CoverageState <------ MeasuredVerdict
        |
        v
CoverageValidator
        |
        v
PlannerBackend
        |
        v
Proposal or Stop

版本化对象:

Object 必须版本化 进入哪些输出
GrammarVersion yes candidate id、candidate set snapshot、coverage delta、stop report
PolicyVersion yes candidate priority、blocked reason、stop threshold、stop report
EngineCapabilityVersion yes axis domain、safe range、candidate patch、failure invalidation
PlannerBackendVersion yes selected candidate、ranking trace

没有版本号的 candidate/stop 不能作为 paper 证据。

HarnessState

HarnessState 是 planner 看到的唯一状态入口。它由结构化数据构成:

Field 来源 作用
study StudySpec tunable schema、topology constraints、SLO、search budget
engine_capability static profile / engine adapter safe ranges、knob lattice、unsupported combinations
workload_profile trace summary / L-C-A profile length/cache/arrival regime
trial_profiles measured results latency、SLO failures、throughput、launch failures
incumbent state best current best config and measured objective
tested_signatures trial history no-repeat and coverage accounting
failure_memory launch/runtime failures invalidate region or operator dimensions

重要约束:

  • HarnessState 不包含 natural-language prompt-only state
  • 所有字段必须可序列化、可 snapshot、可 replay
  • 同一个 HarnessState + Grammar + Policy 必须产生同一个 candidate set。

InterventionGrammar

Grammar 描述 serving tuning 的 intervention space。它不选择候选只声明

  • family
  • axes
  • legal value source
  • generic operators
  • expected effect schema
  • risk schema
  • coverage dimensions
  • required evidence。

建议初始 grammar

Family Axes Operators Coverage dimension
topology TP, DP, EP, EP enable step_up, step_down, redistribute, bracket topology lattice neighborhood
kv_memory gpu-memory-utilization jump_to_floor, local_climb, backoff_after_failure runtime trust region on incumbent topology
admission max-num-seqs raise, lower, bracket sequence concurrency region
batching max-num-batched-tokens, chunked prefill raise, lower, joint_adjust prefill/decode batching region
allocator block size / cache knobs switch_categorical, bracket memory layout region
failure failed signatures/regions invalidate_region, avoid_region failure memory coverage

Grammar example:

family: kv_memory
axes:
  gpu-memory-utilization:
    type: bounded_float
    value_source: engine_capability
    nominal_floor_key: gmu_nominal_floor
    safe_ceiling_key: gmu_safe_ceiling
operators:
  - name: jump_to_floor
    preconditions:
      - axis_below_nominal_floor
      - incumbent_topology_preserved
    coverage_effect: covers_runtime_floor
  - name: local_climb
    preconditions:
      - axis_at_or_above_nominal_floor
      - no_failed_higher_target
    coverage_effect: extends_runtime_trust_region

这里没有写 0.5 -> 0.90.9 来自 engine capability/policy config jump_to_floor 是 bounded numeric axis 的通用 operator。

Topology example:

family: topology
axes:
  tensor-parallel-size:
    type: ordered_lattice
    value_source: topology_constraints
  data-parallel-size:
    type: ordered_lattice
    value_source: topology_constraints
operators:
  - name: step_up
    coverage_effect: covers_upper_neighbor
  - name: step_down
    coverage_effect: covers_lower_neighbor
  - name: bracket
    preconditions:
      - anchor_at_boundary_or_unknown_region
      - neighbor_uncovered
    coverage_effect: brackets_ordered_axis
  - name: redistribute
    coverage_effect: covers_same_product_tp_dp_neighbor

这里没有写 TP=8 -> TP=4。它是 ordered lattice 上的 bracket

AxisSpec and OperatorSpec

Grammar 必须由 typed schema 表达,不能只是自由文本。

AxisSpec

Field 含义
axis_id stable id例如 topology.tp
knob_keys 该 axis lower 到 config patch 时影响的 knobs
type ordered_lattice, bounded_float, bounded_int, categorical, coupled
domain_source topology_constraints, engine_capability, policy_config, study_spec
domain legal values 或 range必须可枚举或可离散化
order 对 ordered axis 的 total/partial order
coupling 与其他 axes 的 legality constraints
safe_region floor/ceiling/default/unsupported regions带 provenance

OperatorSpec

Field 含义
operator_id stable id例如 topology.bracket
family topology / kv_memory / admission / batching / allocator / failure
input_axes operator 作用的 axes
preconditions generic predicates必须引用 state/axis/evidence不引用 testcase
patch_fn 从 state 和 axis value 生成 candidate patch
coverage_effects 产生的 structured CoverageUnit
required_evidence 需要哪些 evidence 才可启用
risk_model risk/cost term 的来源和版本
confirm_reject_schema 如何从 measured verdict 更新 coverage

任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。

Generic Operators

Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。

必须支持的基础 operator

Operator 输入 输出 用途
step_up(axis) ordered axis adjacent higher value 验证上方 frontier
step_down(axis) ordered axis adjacent lower value 验证下方 frontier
bracket(axis) ordered axis + boundary/trust-region state adjacent contrast point 从偏置初始点建立局部 bracket
jump_to_floor(axis) bounded axis + nominal floor floor target 从非法/低效区间回到 safe operating range
local_climb(axis) bounded axis + step policy next value 在 trust region 内局部爬坡
backoff(axis) failed higher target lower safe value 处理 launch/OOM/regression
redistribute(axis_a, axis_b) coupled axes + product constraints legal coupled patch TP/DP/EP redistribution
joint_adjust(axes) coupled runtime axes legal joint patch MBT/MNS 交互
preserve(dimensions) incumbent complete patch context 保持 topology/runtime anchor
block_if_seen(signature) tested signatures reject candidate no-repeat
invalidate_region(failure) failure profile region predicate 避免重复失败区域

Operator 只做 candidate construction不做最终接受。接受由 real measurement 决定。

CandidateAction

每个 candidate 必须是结构化对象:

{
  "candidate_id": "topology.bracket/tp:8->4/dp:1",
  "family": "topology",
  "operator": "bracket",
  "patch": {"flag_patch": {"tensor-parallel-size": 4}},
  "hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.",
  "targets": ["topology_efficiency", "ttft_prefill"],
  "expected_metric_movement": {
    "request_rate_per_gpu": "increase_or_same",
    "ttft_p95": "may_increase",
    "gpu_count": "decrease"
  },
  "confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency",
  "reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance",
  "coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"],
  "risk": {
    "launch": "low",
    "regression": "medium",
    "measurement_cost": "one_trial"
  }
}

这比“try TP=4”严格trial 被绑定到 hypothesis、metric movement 和 coverage update。

CandidateSet

CandidateSet 是 grammar/operator 在当前 state 下的完整枚举结果,而不是 planner 看到的 前几个候选。

CandidateSet = {
  grammar_version,
  policy_version,
  engine_capability_version,
  state_hash,
  candidates[],
  blocked_candidates[],
  candidate_set_hash
}

要求:

  1. OperatorRegistry.generate(state, grammar, policy) 必须枚举所有 eligible candidates。
  2. 所有 filtered candidate 必须进入 blocked_candidates,并带 blocked_reason
  3. Stop proof 必须引用 candidate_set_hash
  4. 同一 HarnessState + Grammar + Policy + EngineCapability 必须生成 byte-identical CandidateSet
  5. 对 toy grammars必须能穷举验证 generator 没漏 candidate。

这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。

CoverageState

Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”,不是只记录 best config。

Coverage units:

Unit 含义
signature_tested exact config patch 已测
operator_covered 某 operator 在某 family/neighborhood 上至少被验证一次
neighbor_covered ordered lattice 的相邻上/下点已测
trust_region_covered incumbent topology 上某 runtime family 的 local region 已测
hypothesis_refuted candidate 的 reject condition 被满足
region_invalidated failure memory 否定一片 region而不是单点
incumbent_validated incumbent 被 topology/runtime/failure dimensions 的反事实实验验证

CoverageUnit 不能是自由字符串,必须是结构化 key

{
  "unit_type": "neighbor_covered",
  "family": "topology",
  "axis": "topology.tp",
  "operator": "bracket",
  "anchor_region": {"tp": 8, "dp": 1},
  "candidate_region": {"tp": 4, "dp": 1},
  "bottleneck": "ttft_prefill",
  "grammar_version": "..."
}

Stop 前不能只依赖 signature_tested。Required coverage 必须至少包含一个非平凡机制 coverage unit例如 neighbor_coveredtrust_region_coveredregion_invalidatedincumbent_validated

Coverage update:

trial_result -> MeasuredVerdict -> CoverageDelta

每个 measured trial 必须至少产生一个 CoverageDelta

  • 增加 signature_tested
  • 增加 family/operator coverage
  • invalidate failure region
  • update incumbent
  • refute/confirm hypothesis。

如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent它应被标记为 wasted_by_design,这会直接违反验收标准。

更严格地说:

signature_tested alone is not enough.

一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何 operator/neighborhood/trust region、没有更新 incumbent也没有产生保守 failure invalidation就必须标记为 non_informative_trial

Failure invalidation

region_invalidated 是最容易过拟合或误杀的部分,必须保守定义。

一个 failure invalidation 必须输出:

{
  "region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99",
  "source_failure": "engine_launch_oom",
  "covered_config_examples": [...],
  "excluded_config_examples": [...],
  "conservative_reason": "failure only observed at higher memory target on same topology",
  "retry_condition": "lower gmu or different topology is tested",
  "expires_after": "optional policy-defined TTL"
}

禁止从单点 failure 无边界外推到整个 family。例如一次 TP=8,gmu=0.99 OOM 不能直接 invalidate 所有 TP=8 或所有高 TP它最多 invalidates 包含相同或更高 memory pressure 的保守 region除非后续 evidence 扩大该 region。

CoverageValidator

Stop 必须相对于 coverage而不是相对于 Python guard cascade。

Stop condition:

stop_allowed iff
  legality soundness holds
  and candidate_set_snapshot is complete and persisted
  and no candidate with harness_priority >= threshold remains uncovered
  and incumbent has required validation coverage
  and measurement budget/search-high condition is satisfied or no eligible candidate remains

这里不用未定义的 useful candidate。一个 candidate 是否 eligible 由以下结构化条件决定:

eligible(candidate) iff
  candidate is legal
  and candidate is not exact-repeat
  and candidate.coverage_unit is not already covered
  and candidate.region is not invalidated
  and candidate.required_evidence is satisfied
  and candidate.harness_priority >= policy.min_candidate_priority

required validation coverage 也必须结构化声明。例如:

required_validation:
  incumbent:
    - family: topology
      unit_type: neighbor_covered
      neighborhood: adjacent
    - family: kv_memory
      unit_type: trust_region_covered
      when_axis_tunable: gpu-memory-utilization

Validator 输出必须包含:

{
  "should_stop": true,
  "stop_kind": "coverage_relative_stop",
  "candidate_set_hash": "...",
  "grammar_version": "...",
  "policy_version": "...",
  "engine_capability_version": "...",
  "covered_units": [...],
  "remaining_candidates": [],
  "blocked_candidates": [
    {"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"}
  ],
  "proof_obligation": "No uncovered candidate above priority threshold under grammar version X."
}

这里的 proof 是 relative proof

不是证明真实世界没有更优 config
而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。

Planner backend 分离

Harness 产出 CandidateSetplanner backend 只做选择。

CandidateSet = Grammar(HarnessState)
decision = PlannerBackend.rank_or_select(CandidateSet)

Planner backend 可以包括:

Backend 作用
deterministic greedy 选择最高 priority / score candidate
LLM ranker 读取 candidate set 和 evidence只允许选择或组合合法 candidate
BO/bandit 在 intervention graph 上学习 candidate priority
oracle replay 用已有 grid/replay data 做离线验证

关键约束:

  • planner 不得发明 grammar 外 knob
  • planner 不得绕过 validator stop
  • planner 只能改变 ranking/selection不能改变 legality/coverage
  • 同一 candidate set 下LLM 与 BO 的差异是 backend 差异,不是 harness 差异。

Priority 职责边界:

  • harness_priority 属于 harness/policy用于 validator 判断 candidate 是否 high-priority。
  • backend_score 属于 planner backend用于在 eligible candidate 中排序。
  • BO/bandit 可以学习 backend_score,但不能改变 harness_priority、coverage 和 legality。
  • 如果需要让学习结果影响 stop threshold必须显式进入 PolicyVersion,并重新生成 candidate set snapshot。

LLM 组合候选的限制:

  • 默认不允许 LLM 组合多个 candidates
  • 如果启用组合,组合必须生成新的 CandidateAction
  • 组合后的 patch 必须重新经过 legality validator
  • 组合后的 coverage effect 不能简单取 union必须由 grammar 中的 joint_adjust operator 生成;
  • 没有 joint_adjust operator 的组合不允许执行。

Required implementation interfaces

后续代码重构必须先落这些接口,再迁移现有 heuristic。

class InterventionGrammar:
    @classmethod
    def load(cls, path: str) -> "InterventionGrammar": ...
    def validate(self, capability: "EngineCapability") -> None: ...


class OperatorRegistry:
    def generate(
        self,
        state: "HarnessState",
        grammar: "InterventionGrammar",
        policy: "HarnessPolicy",
    ) -> "CandidateSet": ...


class InterventionOperator:
    def apply(
        self,
        axis: "AxisSpec",
        state: "HarnessState",
        policy: "HarnessPolicy",
    ) -> "CandidateAction | BlockedCandidate": ...


class CoverageState:
    def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ...


class CoverageValidator:
    def validate_stop(
        self,
        state: "HarnessState",
        candidate_set: "CandidateSet",
        coverage: "CoverageState",
        policy: "HarnessPolicy",
    ) -> "StopReport": ...


class PlannerBackend:
    def select(
        self,
        candidate_set: "CandidateSet",
        state: "HarnessState",
    ) -> "CandidateAction | StopRequest": ...

Required schemas:

Schema 必须字段
CandidateSignature exact signature、normalized full config signature、partial patch signature、hash version
MeasuredVerdict candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs
CoverageDelta added units、invalidated regions、incumbent update、non-informative flag
BlockedCandidate candidate skeleton、blocked reason、blocking predicate、retry condition
StopReport candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation

这些 schema 是 paper artifact 的一部分。没有它们,实验结果无法证明 harness 不是 case-specific heuristic。

Anti-overfitting 约束

禁止项

以下内容不得出现在 harness core、grammar 或 operator 中:

  • model name例如 Qwen30B、Qwen27B
  • host name例如 dash0
  • case id、run id、trace window id
  • known winning config例如 “TP=2+gmu=0.97”;
  • 固定 SLO 数值作为控制逻辑,例如 “TPOT=50ms 时做 X”
  • 根据实验结果补的特例 branch例如 “如果 current_tp > 2 就强行 downshift”
  • 针对单个 regression test 的 action id 或 score boost。

允许项

以下内容可以存在,但必须声明来源和适用范围:

  • engine capability 中的 safe range例如 gpu-memory-utilization nominal floor
  • topology constraints 中的 legal lattice例如 TP/DP/EP 值;
  • policy config 中的 measurement-cost / risk prior
  • workload profile 中的 length/cache/arrival regime
  • SLO rule 本身;
  • measured trial evidence。

Priority/threshold provenance

所有 priority term、threshold、risk/cost prior 必须记录:

name
default value
source: engine docs / prior experiments / conservative default / calibrated profile
scope: global / engine-family / engine-version / hardware-class
sensitivity test
last updated commit

禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时,必须跑 sensitivity test 和 held-out scenario tests。

Rule 与 grammar 的边界

不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate

禁止 允许
if model == Qwen30B if family == topology and axis.type == ordered_lattice
if TP == 8 then TP=4 bracket(ordered_axis)
if gmu == 0.5 then 0.9 jump_to_floor(bounded_axis)
if TTFT4s/TPOT25 case if bottleneck hypothesis targets TTFT/TPOT evidence
score = 0.74 for this bad-start fix priority = relief_prior + information_gain - risk - cost from policy terms

可证明性质

P1. Legality soundness

对于任意 generated candidate

candidate.patch satisfies:
  tunable_envs/tunable_flags
  topology constraints
  engine capability safe constraints
  no forbidden knob

测试方式:

  • grammar-level property tests
  • random StudySpec / topology constraints fuzzing
  • all candidate patches pass validate_proposal equivalent checks。

P2. No-repeat

对于任意 candidate

signature(candidate.patch) not in tested_signatures

如果 exact signature 被测过candidate 必须被 filtered 或标记为 already-covered。

P3. Coverage monotonicity

每个 measured trial 后:

coverage_{t+1} >= coverage_t

其中 >= 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence 不减少。

P4. Coverage-relative stop soundness

如果 validator 输出 stop

for all candidate in persisted CandidateSet(grammar, state, policy):
  candidate.harness_priority < threshold
  or candidate.coverage_unit already covered
  or candidate.region invalidated
  or candidate violates constraints

这是相对于 grammar 的 soundness不是 global optimality。

额外前提:

  • CandidateSet 必须是完整枚举;
  • CandidateSet snapshot 必须持久化;
  • StopReport 必须包含 candidate_set_hash
  • required coverage 不能只由 signature_tested 满足。

P5. Backend independence

给定相同 HarnessState + Grammar + Policy

CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends.

不同 backend 只能改变 backend_score、candidate ranking 或 tie-break。

P6. Auditability

每个 executed trial 必须能追溯:

proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta

缺失任一环节则不允许作为 paper 证据。

不能证明的性质

必须在 paper 中显式避免:

  • global optimum
  • arbitrary bad-start completeness
  • arbitrary workload/SLO generalization
  • engine-independent optimality
  • bottleneck classifier correctness
  • stop implies no better real config。

可以说:

Within the declared intervention grammar and measurement budget, AITuner provides
coverage-relative experimental control and empirically improves convergence on the
reported workloads.

验收标准

这部分是最严格的 gate。没有通过这些标准之前不应把新 harness 当作 paper contribution。

A. 设计验收

  1. Harness core 不允许新增 testcase-specific branch。
  2. 所有 knob 行为必须来自 grammar/operator/policy config。
  3. Magic constants 必须集中到 HarnessPolicy 或 engine capability并说明来源。
  4. Candidate generator 必须输出 CandidateAction,包含 hypothesis、operator、 confirm/reject condition 和 coverage effect。
  5. Stop validator 必须输出 coverage-relative proof obligation。
  6. Planner backend 不能绕过 candidate set 或 stop validator。
  7. CandidateSet 必须完整枚举、持久化 snapshot并由 stop report 引用。
  8. harness_prioritybackend_score 必须分离。
  9. CoverageUnit 必须结构化,且 required coverage 不能只靠 exact signature。
  10. Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。

B. 代码验收

  1. harness.py 不再继续膨胀;新实现拆分为:
    • harness_state.py
    • intervention_grammar.py
    • operators.py
    • coverage.py
    • planner_backend.py
    • harness_compat.py or equivalent adapter。
  2. CandidateActionCoverageStateCoverageDeltaHarnessPolicy 使用 dataclass 或 typed schema。
  3. 当前 _topology_candidate_actions()_runtime_candidate_actions() 的逻辑必须迁移 到 operator registry。
  4. 所有 candidate patch 必须通过 legality validator。
  5. 所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。
  6. 当前 regression tests 可以保留为 scenario tests但不得通过新增 hardcoded branch 来修。
  7. StopReport 必须包含 candidate_set_hash、grammar/policy/capability versions、 remaining/covered/blocked candidates。
  8. Inconclusive measurement 不能错误地产生 hypothesis_refutedregion_invalidated

C. 静态 anti-overfitting 验收

CI 或测试中必须检查:

  1. Harness core 不包含 model name / run id / host name / known winning config 字符串。
  2. Harness core 不包含 case-specific action id。
  3. Score boost 不能针对单一 scenario test。
  4. Grammar 文件只能引用 family/operator/axis不引用 testcase。
  5. 新增 scenario failure 时PR 必须说明是补 grammar/operator还是调整 policy config 不允许直接补 testcase branch。
  6. Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。
  7. 新增或修改 policy threshold 必须包含 sensitivity test。
  8. 新增 operator/policy 后必须跑 held-out scenarios不能只在触发该改动的 scenario 上通过。

D. 单元测试验收

必须新增 grammar-level tests

  1. Legal candidate generation under random topology constraints。
  2. No-repeat property。
  3. Coverage monotonicity。
  4. Stop iff no uncovered high-priority candidate。
  5. Failure memory invalidates regions, not just exact signatures。
  6. Backend independence: greedy vs mock-LLM ranker candidate set 一致。
  7. Bad-start property tests:
    • 对 ordered lattice 的 boundary startsbracket 生成相邻 contrast candidate
    • 对 bounded numeric axis 的 below-floor startsjump_to_floor 生成 floor candidate
    • 测试不得写具体 TP=8 -> TP=40.5 -> 0.9 作为唯一条件,而应参数化。
  8. Candidate set determinism: 同一输入输出 byte-identical snapshot。
  9. Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。
  10. Combined candidate legality: 如果启用组合,组合后的 patch 和 coverage 必须合法。
  11. Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。
  12. Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。
  13. Stop falsification: 对已 stop 的 history用 local grid/replay 检查是否漏掉 grammar 内 high-priority candidate。

E. 实验验收

Paper 证据必须至少包含:

  1. Planner-agnostic ablation

    • raw greedy / BO vs grammar-guided greedy / BO
    • weak LLM naive vs weak LLM + harness
    • strong LLM naive vs strong LLM + harness。
  2. Mechanism ablation

    • full grammar
    • no attribution
    • shuffled attribution
    • no grammar, raw knobs
    • no coverage validator
    • no failure memory。
  3. Bad-start robustness

    • 多个 random/adversarial starts
    • 不是只跑 TP=8,gmu=0.5,max-num-seqs=8
    • 报告 convergence distribution而不是单条成功路径。
  4. Near-optimum check

    • 至少一个 case 做 local grid/expert comparison
    • 报告 AITuner best 与 local grid best 的 gap。
  5. Cross-regime check

    • 至少两个不同 regimelong-prefill/tight-TTFT 和 decode/admission-heavy。

F. Paper claim 验收

不通过上述实验前paper 只能说:

The current implementation demonstrates the feasibility of a deterministic,
mechanism-guided proposal loop on selected cases.

通过设计和代码验收后,可以说:

AITuner implements a planner-agnostic intervention grammar with coverage-relative
stop authority.

通过实验验收后,才可以说:

The harness improves fixed-budget tuning and convergence robustness across the
reported regimes, and the gain is not attributable solely to a stronger LLM or
case-specific prompt engineering.

仍然不能说:

The harness is complete over all configs.
The harness guarantees global optimum.
The harness is robust to arbitrary workloads/engines.

迁移计划

Phase 0: 冻结 rule growth

  • 暂停向 harness.py 添加新的 testcase-specific rules。
  • 所有新 bad case 先写成 missing grammar/operator issue。
  • Roadmap 中标记当前 rule-based harness 为 prototype。

Phase 1: Typed state and candidate schema

  • 抽出 TrialProfileBottleneckHypothesisCandidateActionCoverageStateHarnessPolicy
  • 同时抽出 AxisSpecOperatorSpecCandidateSetCoverageUnitCoverageDeltaMeasuredVerdictBlockedCandidateStopReport
  • 现有行为保持兼容,只改变结构。
  • Golden tests 使用当前 outputs 防止 accidental regression。

Phase 2: Operator registry

  • 先迁移 topology、kv_memory、admission、batching 四类。
  • 每个 operator 负责生成 candidate 和 coverage effect。
  • _topology_candidate_actions()_runtime_candidate_actions() 变成 compat wrapper。

Phase 3: Coverage validator

  • 实现 coverage accounting。
  • StopPolicy 先复现当前 guard 结果,再替换 family-count heuristic。
  • 新增 coverage-relative stop tests。

Phase 4: Planner backend split

  • Greedy planner 作为默认 backend。
  • LLM backend 只能 rank/select candidate。
  • BO/bandit backend 可后续接入。

Phase 5: Experiment rerun

  • 重跑 no-LLM Qwen30B
  • 重跑 2x2
  • 跑 bad-start distribution
  • 跑 mechanism ablation
  • 跑 local grid/expert comparison。

最关键的判断标准

如果一个新 failure 只能通过下面方式修复:

在 planner 里新增一个针对具体 config/workload 的 if-else

那么这不是 harness contribution。

如果一个新 failure 通过下面方式修复:

新增或修正一个通用 operator / coverage predicate / engine capability declaration
并且该变化改善一类 parameterized tests

它才可能是 harness contribution。

这条标准应该作为后续所有 PR 和实验解释的 gate。