34 KiB
Declarative Intervention Harness Design - 2026-06-26
本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的
if/else 调参规则,并明确哪些性质可以被证明,哪些 claim 必须通过实验验证,哪些
claim 不能说。
结论先说清楚:
harness 的贡献不是“专家规则知道该怎么调 vLLM”。
harness 的贡献应该是:
把 raw knob optimization 转成 declarative, coverage-aware experimental design。
也就是说,harness 不应该直接 hardcode 下一个答案;它应该定义:
- 什么状态是可观测的;
- 什么 intervention 是合法的;
- 每个 intervention 验证哪个系统假设;
- 哪些区域已经被测量、否定或覆盖;
- 什么时候 planner 还能继续选实验;
- 什么时候 stop 是 coverage-relative sound。
Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些 planner 共享的 substrate,不是某个 planner 的 prompt trick。
Independent review verdict
本文档初稿经过 fresh subagent 独立 review,结论是:
accept with major revisions
审稿意见的核心警告是:把规则从 Python if/else 搬到 grammar、operator、policy 或
priority 中,仍然可能只是“换皮的 rule-based harness”。因此,本设计必须额外满足:
- candidate set 必须完整枚举,并持久化 snapshot;
- priority/threshold 必须有 provenance、版本和 sensitivity test;
- coverage unit 必须结构化,不能只靠
signature_tested; - failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件;
- stop proof 必须引用 candidate set snapshot;
- grammar/policy/capability 也必须接受 anti-overfitting 静态检查;
- LLM/BO 只能选择 candidate,不能暗中改变 legality、coverage 或 stop authority。
下面的设计已经把这些 major revisions 纳入硬性要求。
当前问题
当前 src/aituner/harness.py 已经具备了一些正确的抽象词汇:observation、
bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based:
- observation、bottleneck attribution、candidate generation、scoring、stop validator 都集中在一个 Python 文件里;
- knob family 的行为由 Python 分支和固定阈值决定;
- scoring 使用人工常数;
- stop validator 是 guard cascade,不是显式 coverage proof;
- 每次发现坏 case,容易继续加一个新 branch,例如 bad high-TP start 或 low-gmu start。
这类实现可以做工程原型,但不能成为系统贡献。它最多证明:
我们为已见过的一些 case 写出了一组能工作的 heuristic。
它不能证明:
- harness 是通用的;
- harness 对任意 bad start 都 robust;
- stop 时不存在未测更优 config;
- candidate generator 覆盖了所有重要 intervention;
- AITuner 的收益不是来自 testcase-specific rule accumulation。
因此,后续不能继续用“给 failure mode 补 rule”的方式推进。
设计目标
新的 harness 需要满足五个目标。
-
Declarative 系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在, 而不是散落在 Python
if/else中。 -
Planner-agnostic Harness 负责 legality、candidate construction、coverage accounting 和 stop authority;planner 只负责在合法 candidate set 上排序或选择。
-
Coverage-aware Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确 coverage 语义。
-
Anti-overfitting 设计中不得出现 model name、case id、known winning config、固定 workload id 或实验 结果常数。Case 只能通过
StudySpec、engine capability、workload profile 和 trial measurements 进入系统。 -
Auditable 每个 proposal 必须能回答:
这个 trial 验证哪个 hypothesis?
它改变哪个 intervention dimension?
它的 confirm/reject condition 是什么?
trial 结束后 coverage 如何变化?
非目标
我们不追求也不应 claim 以下目标:
- global optimum;
- 任意 config 空间的 completeness;
- 任意 serving engine/workload/SLO 的 robustness;
- bottleneck classifier 永远正确;
- stop 时真实世界不存在任何更优 config。
我们追求的是更严格但可证明的相对性质:
在声明的 grammar、operator set、engine constraints 和 measurement budget 下,
harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。
新架构
StudySpec + EngineCapability + WorkloadProfile + TrialHistory
|
v
HarnessState
|
v
InterventionGrammar
|
v
OperatorRegistry -----> CandidateSet
| |
v v
CoverageState <------ MeasuredVerdict
|
v
CoverageValidator
|
v
PlannerBackend
|
v
Proposal or Stop
版本化对象:
| Object | 必须版本化 | 进入哪些输出 |
|---|---|---|
GrammarVersion |
yes | candidate id、candidate set snapshot、coverage delta、stop report |
PolicyVersion |
yes | candidate priority、blocked reason、stop threshold、stop report |
EngineCapabilityVersion |
yes | axis domain、safe range、candidate patch、failure invalidation |
PlannerBackendVersion |
yes | selected candidate、ranking trace |
没有版本号的 candidate/stop 不能作为 paper 证据。
HarnessState
HarnessState 是 planner 看到的唯一状态入口。它由结构化数据构成:
| Field | 来源 | 作用 |
|---|---|---|
study |
StudySpec |
tunable schema、topology constraints、SLO、search budget |
engine_capability |
static profile / engine adapter | safe ranges、knob lattice、unsupported combinations |
workload_profile |
trace summary / L-C-A profile | length/cache/arrival regime |
trial_profiles |
measured results | latency、SLO failures、throughput、launch failures |
incumbent |
state best | current best config and measured objective |
tested_signatures |
trial history | no-repeat and coverage accounting |
failure_memory |
launch/runtime failures | invalidate region or operator dimensions |
重要约束:
HarnessState不包含 natural-language prompt-only state;- 所有字段必须可序列化、可 snapshot、可 replay;
- 同一个
HarnessState + Grammar + Policy必须产生同一个 candidate set。
InterventionGrammar
Grammar 描述 serving tuning 的 intervention space。它不选择候选,只声明:
- family;
- axes;
- legal value source;
- generic operators;
- expected effect schema;
- risk schema;
- coverage dimensions;
- required evidence。
建议初始 grammar:
| Family | Axes | Operators | Coverage dimension |
|---|---|---|---|
topology |
TP, DP, EP, EP enable | step_up, step_down, redistribute, bracket |
topology lattice neighborhood |
kv_memory |
gpu-memory-utilization |
jump_to_floor, local_climb, backoff_after_failure |
runtime trust region on incumbent topology |
admission |
max-num-seqs |
raise, lower, bracket |
sequence concurrency region |
batching |
max-num-batched-tokens, chunked prefill |
raise, lower, joint_adjust |
prefill/decode batching region |
allocator |
block size / cache knobs | switch_categorical, bracket |
memory layout region |
failure |
failed signatures/regions | invalidate_region, avoid_region |
failure memory coverage |
Grammar example:
family: kv_memory
axes:
gpu-memory-utilization:
type: bounded_float
value_source: engine_capability
nominal_floor_key: gmu_nominal_floor
safe_ceiling_key: gmu_safe_ceiling
operators:
- name: jump_to_floor
preconditions:
- axis_below_nominal_floor
- incumbent_topology_preserved
coverage_effect: covers_runtime_floor
- name: local_climb
preconditions:
- axis_at_or_above_nominal_floor
- no_failed_higher_target
coverage_effect: extends_runtime_trust_region
这里没有写 0.5 -> 0.9。0.9 来自 engine capability/policy config,
jump_to_floor 是 bounded numeric axis 的通用 operator。
Topology example:
family: topology
axes:
tensor-parallel-size:
type: ordered_lattice
value_source: topology_constraints
data-parallel-size:
type: ordered_lattice
value_source: topology_constraints
operators:
- name: step_up
coverage_effect: covers_upper_neighbor
- name: step_down
coverage_effect: covers_lower_neighbor
- name: bracket
preconditions:
- anchor_at_boundary_or_unknown_region
- neighbor_uncovered
coverage_effect: brackets_ordered_axis
- name: redistribute
coverage_effect: covers_same_product_tp_dp_neighbor
这里没有写 TP=8 -> TP=4。它是 ordered lattice 上的 bracket。
AxisSpec and OperatorSpec
Grammar 必须由 typed schema 表达,不能只是自由文本。
AxisSpec:
| Field | 含义 |
|---|---|
axis_id |
stable id,例如 topology.tp |
knob_keys |
该 axis lower 到 config patch 时影响的 knobs |
type |
ordered_lattice, bounded_float, bounded_int, categorical, coupled |
domain_source |
topology_constraints, engine_capability, policy_config, study_spec |
domain |
legal values 或 range;必须可枚举或可离散化 |
order |
对 ordered axis 的 total/partial order |
coupling |
与其他 axes 的 legality constraints |
safe_region |
floor/ceiling/default/unsupported regions,带 provenance |
OperatorSpec:
| Field | 含义 |
|---|---|
operator_id |
stable id,例如 topology.bracket |
family |
topology / kv_memory / admission / batching / allocator / failure |
input_axes |
operator 作用的 axes |
preconditions |
generic predicates,必须引用 state/axis/evidence,不引用 testcase |
patch_fn |
从 state 和 axis value 生成 candidate patch |
coverage_effects |
产生的 structured CoverageUnit |
required_evidence |
需要哪些 evidence 才可启用 |
risk_model |
risk/cost term 的来源和版本 |
confirm_reject_schema |
如何从 measured verdict 更新 coverage |
任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。
Generic Operators
Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。
必须支持的基础 operator:
| Operator | 输入 | 输出 | 用途 |
|---|---|---|---|
step_up(axis) |
ordered axis | adjacent higher value | 验证上方 frontier |
step_down(axis) |
ordered axis | adjacent lower value | 验证下方 frontier |
bracket(axis) |
ordered axis + boundary/trust-region state | adjacent contrast point | 从偏置初始点建立局部 bracket |
jump_to_floor(axis) |
bounded axis + nominal floor | floor target | 从非法/低效区间回到 safe operating range |
local_climb(axis) |
bounded axis + step policy | next value | 在 trust region 内局部爬坡 |
backoff(axis) |
failed higher target | lower safe value | 处理 launch/OOM/regression |
redistribute(axis_a, axis_b) |
coupled axes + product constraints | legal coupled patch | TP/DP/EP redistribution |
joint_adjust(axes) |
coupled runtime axes | legal joint patch | MBT/MNS 交互 |
preserve(dimensions) |
incumbent | complete patch context | 保持 topology/runtime anchor |
block_if_seen(signature) |
tested signatures | reject candidate | no-repeat |
invalidate_region(failure) |
failure profile | region predicate | 避免重复失败区域 |
Operator 只做 candidate construction,不做最终接受。接受由 real measurement 决定。
CandidateAction
每个 candidate 必须是结构化对象:
{
"candidate_id": "topology.bracket/tp:8->4/dp:1",
"family": "topology",
"operator": "bracket",
"patch": {"flag_patch": {"tensor-parallel-size": 4}},
"hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.",
"targets": ["topology_efficiency", "ttft_prefill"],
"expected_metric_movement": {
"request_rate_per_gpu": "increase_or_same",
"ttft_p95": "may_increase",
"gpu_count": "decrease"
},
"confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency",
"reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance",
"coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"],
"risk": {
"launch": "low",
"regression": "medium",
"measurement_cost": "one_trial"
}
}
这比“try TP=4”严格:trial 被绑定到 hypothesis、metric movement 和 coverage update。
CandidateSet
CandidateSet 是 grammar/operator 在当前 state 下的完整枚举结果,而不是 planner 看到的
前几个候选。
CandidateSet = {
grammar_version,
policy_version,
engine_capability_version,
state_hash,
candidates[],
blocked_candidates[],
candidate_set_hash
}
要求:
OperatorRegistry.generate(state, grammar, policy)必须枚举所有 eligible candidates。- 所有 filtered candidate 必须进入
blocked_candidates,并带blocked_reason。 - Stop proof 必须引用
candidate_set_hash。 - 同一
HarnessState + Grammar + Policy + EngineCapability必须生成 byte-identicalCandidateSet。 - 对 toy grammars,必须能穷举验证 generator 没漏 candidate。
这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。
CoverageState
Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”,不是只记录 best config。
Coverage units:
| Unit | 含义 |
|---|---|
signature_tested |
exact config patch 已测 |
operator_covered |
某 operator 在某 family/neighborhood 上至少被验证一次 |
neighbor_covered |
ordered lattice 的相邻上/下点已测 |
trust_region_covered |
incumbent topology 上某 runtime family 的 local region 已测 |
hypothesis_refuted |
candidate 的 reject condition 被满足 |
region_invalidated |
failure memory 否定一片 region,而不是单点 |
incumbent_validated |
incumbent 被 topology/runtime/failure dimensions 的反事实实验验证 |
CoverageUnit 不能是自由字符串,必须是结构化 key:
{
"unit_type": "neighbor_covered",
"family": "topology",
"axis": "topology.tp",
"operator": "bracket",
"anchor_region": {"tp": 8, "dp": 1},
"candidate_region": {"tp": 4, "dp": 1},
"bottleneck": "ttft_prefill",
"grammar_version": "..."
}
Stop 前不能只依赖 signature_tested。Required coverage 必须至少包含一个非平凡机制
coverage unit,例如 neighbor_covered、trust_region_covered、region_invalidated
或 incumbent_validated。
Coverage update:
trial_result -> MeasuredVerdict -> CoverageDelta
每个 measured trial 必须至少产生一个 CoverageDelta:
- 增加
signature_tested; - 增加 family/operator coverage;
- invalidate failure region;
- update incumbent;
- refute/confirm hypothesis。
如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent,它应被标记为
wasted_by_design,这会直接违反验收标准。
更严格地说:
signature_tested alone is not enough.
一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何
operator/neighborhood/trust region、没有更新 incumbent,也没有产生保守 failure
invalidation,就必须标记为 non_informative_trial。
Failure invalidation
region_invalidated 是最容易过拟合或误杀的部分,必须保守定义。
一个 failure invalidation 必须输出:
{
"region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99",
"source_failure": "engine_launch_oom",
"covered_config_examples": [...],
"excluded_config_examples": [...],
"conservative_reason": "failure only observed at higher memory target on same topology",
"retry_condition": "lower gmu or different topology is tested",
"expires_after": "optional policy-defined TTL"
}
禁止从单点 failure 无边界外推到整个 family。例如一次 TP=8,gmu=0.99 OOM 不能直接
invalidate 所有 TP=8 或所有高 TP;它最多 invalidates 包含相同或更高 memory pressure
的保守 region,除非后续 evidence 扩大该 region。
CoverageValidator
Stop 必须相对于 coverage,而不是相对于 Python guard cascade。
Stop condition:
stop_allowed iff
legality soundness holds
and candidate_set_snapshot is complete and persisted
and no candidate with harness_priority >= threshold remains uncovered
and incumbent has required validation coverage
and measurement budget/search-high condition is satisfied or no eligible candidate remains
这里不用未定义的 useful candidate。一个 candidate 是否 eligible 由以下结构化条件决定:
eligible(candidate) iff
candidate is legal
and candidate is not exact-repeat
and candidate.coverage_unit is not already covered
and candidate.region is not invalidated
and candidate.required_evidence is satisfied
and candidate.harness_priority >= policy.min_candidate_priority
required validation coverage 也必须结构化声明。例如:
required_validation:
incumbent:
- family: topology
unit_type: neighbor_covered
neighborhood: adjacent
- family: kv_memory
unit_type: trust_region_covered
when_axis_tunable: gpu-memory-utilization
Validator 输出必须包含:
{
"should_stop": true,
"stop_kind": "coverage_relative_stop",
"candidate_set_hash": "...",
"grammar_version": "...",
"policy_version": "...",
"engine_capability_version": "...",
"covered_units": [...],
"remaining_candidates": [],
"blocked_candidates": [
{"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"}
],
"proof_obligation": "No uncovered candidate above priority threshold under grammar version X."
}
这里的 proof 是 relative proof:
不是证明真实世界没有更优 config;
而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。
Planner backend 分离
Harness 产出 CandidateSet,planner backend 只做选择。
CandidateSet = Grammar(HarnessState)
decision = PlannerBackend.rank_or_select(CandidateSet)
Planner backend 可以包括:
| Backend | 作用 |
|---|---|
| deterministic greedy | 选择最高 priority / score candidate |
| LLM ranker | 读取 candidate set 和 evidence,只允许选择或组合合法 candidate |
| BO/bandit | 在 intervention graph 上学习 candidate priority |
| oracle replay | 用已有 grid/replay data 做离线验证 |
关键约束:
- planner 不得发明 grammar 外 knob;
- planner 不得绕过 validator stop;
- planner 只能改变 ranking/selection,不能改变 legality/coverage;
- 同一 candidate set 下,LLM 与 BO 的差异是 backend 差异,不是 harness 差异。
Priority 职责边界:
harness_priority属于 harness/policy,用于 validator 判断 candidate 是否 high-priority。backend_score属于 planner backend,用于在 eligible candidate 中排序。- BO/bandit 可以学习
backend_score,但不能改变harness_priority、coverage 和 legality。 - 如果需要让学习结果影响 stop threshold,必须显式进入
PolicyVersion,并重新生成 candidate set snapshot。
LLM 组合候选的限制:
- 默认不允许 LLM 组合多个 candidates;
- 如果启用组合,组合必须生成新的
CandidateAction; - 组合后的 patch 必须重新经过 legality validator;
- 组合后的 coverage effect 不能简单取 union,必须由 grammar 中的
joint_adjustoperator 生成; - 没有
joint_adjustoperator 的组合不允许执行。
Required implementation interfaces
后续代码重构必须先落这些接口,再迁移现有 heuristic。
class InterventionGrammar:
@classmethod
def load(cls, path: str) -> "InterventionGrammar": ...
def validate(self, capability: "EngineCapability") -> None: ...
class OperatorRegistry:
def generate(
self,
state: "HarnessState",
grammar: "InterventionGrammar",
policy: "HarnessPolicy",
) -> "CandidateSet": ...
class InterventionOperator:
def apply(
self,
axis: "AxisSpec",
state: "HarnessState",
policy: "HarnessPolicy",
) -> "CandidateAction | BlockedCandidate": ...
class CoverageState:
def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ...
class CoverageValidator:
def validate_stop(
self,
state: "HarnessState",
candidate_set: "CandidateSet",
coverage: "CoverageState",
policy: "HarnessPolicy",
) -> "StopReport": ...
class PlannerBackend:
def select(
self,
candidate_set: "CandidateSet",
state: "HarnessState",
) -> "CandidateAction | StopRequest": ...
Required schemas:
| Schema | 必须字段 |
|---|---|
CandidateSignature |
exact signature、normalized full config signature、partial patch signature、hash version |
MeasuredVerdict |
candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs |
CoverageDelta |
added units、invalidated regions、incumbent update、non-informative flag |
BlockedCandidate |
candidate skeleton、blocked reason、blocking predicate、retry condition |
StopReport |
candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation |
这些 schema 是 paper artifact 的一部分。没有它们,实验结果无法证明 harness 不是 case-specific heuristic。
Anti-overfitting 约束
禁止项
以下内容不得出现在 harness core、grammar 或 operator 中:
- model name,例如 Qwen30B、Qwen27B;
- host name,例如 dash0;
- case id、run id、trace window id;
- known winning config,例如 “TP=2+gmu=0.97”;
- 固定 SLO 数值作为控制逻辑,例如 “TPOT=50ms 时做 X”;
- 根据实验结果补的特例 branch,例如 “如果 current_tp > 2 就强行 downshift”;
- 针对单个 regression test 的 action id 或 score boost。
允许项
以下内容可以存在,但必须声明来源和适用范围:
- engine capability 中的 safe range,例如
gpu-memory-utilizationnominal floor; - topology constraints 中的 legal lattice,例如 TP/DP/EP 值;
- policy config 中的 measurement-cost / risk prior;
- workload profile 中的 length/cache/arrival regime;
- SLO rule 本身;
- measured trial evidence。
Priority/threshold provenance:
所有 priority term、threshold、risk/cost prior 必须记录:
name
default value
source: engine docs / prior experiments / conservative default / calibrated profile
scope: global / engine-family / engine-version / hardware-class
sensitivity test
last updated commit
禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时,必须跑 sensitivity test 和 held-out scenario tests。
Rule 与 grammar 的边界
不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate:
| 禁止 | 允许 |
|---|---|
if model == Qwen30B |
if family == topology and axis.type == ordered_lattice |
if TP == 8 then TP=4 |
bracket(ordered_axis) |
if gmu == 0.5 then 0.9 |
jump_to_floor(bounded_axis) |
if TTFT4s/TPOT25 case |
if bottleneck hypothesis targets TTFT/TPOT evidence |
score = 0.74 for this bad-start fix |
priority = relief_prior + information_gain - risk - cost from policy terms |
可证明性质
P1. Legality soundness
对于任意 generated candidate:
candidate.patch satisfies:
tunable_envs/tunable_flags
topology constraints
engine capability safe constraints
no forbidden knob
测试方式:
- grammar-level property tests;
- random StudySpec / topology constraints fuzzing;
- all candidate patches pass
validate_proposalequivalent checks。
P2. No-repeat
对于任意 candidate:
signature(candidate.patch) not in tested_signatures
如果 exact signature 被测过,candidate 必须被 filtered 或标记为 already-covered。
P3. Coverage monotonicity
每个 measured trial 后:
coverage_{t+1} >= coverage_t
其中 >= 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence
不减少。
P4. Coverage-relative stop soundness
如果 validator 输出 stop:
for all candidate in persisted CandidateSet(grammar, state, policy):
candidate.harness_priority < threshold
or candidate.coverage_unit already covered
or candidate.region invalidated
or candidate violates constraints
这是相对于 grammar 的 soundness,不是 global optimality。
额外前提:
- CandidateSet 必须是完整枚举;
- CandidateSet snapshot 必须持久化;
- StopReport 必须包含
candidate_set_hash; - required coverage 不能只由
signature_tested满足。
P5. Backend independence
给定相同 HarnessState + Grammar + Policy:
CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends.
不同 backend 只能改变 backend_score、candidate ranking 或 tie-break。
P6. Auditability
每个 executed trial 必须能追溯:
proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta
缺失任一环节则不允许作为 paper 证据。
不能证明的性质
必须在 paper 中显式避免:
- global optimum;
- arbitrary bad-start completeness;
- arbitrary workload/SLO generalization;
- engine-independent optimality;
- bottleneck classifier correctness;
- stop implies no better real config。
可以说:
Within the declared intervention grammar and measurement budget, AITuner provides
coverage-relative experimental control and empirically improves convergence on the
reported workloads.
验收标准
这部分是最严格的 gate。没有通过这些标准之前,不应把新 harness 当作 paper contribution。
A. 设计验收
- Harness core 不允许新增 testcase-specific branch。
- 所有 knob 行为必须来自 grammar/operator/policy config。
- Magic constants 必须集中到
HarnessPolicy或 engine capability,并说明来源。 - Candidate generator 必须输出
CandidateAction,包含 hypothesis、operator、 confirm/reject condition 和 coverage effect。 - Stop validator 必须输出 coverage-relative proof obligation。
- Planner backend 不能绕过 candidate set 或 stop validator。
- CandidateSet 必须完整枚举、持久化 snapshot,并由 stop report 引用。
harness_priority和backend_score必须分离。CoverageUnit必须结构化,且 required coverage 不能只靠 exact signature。- Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。
B. 代码验收
harness.py不再继续膨胀;新实现拆分为:harness_state.pyintervention_grammar.pyoperators.pycoverage.pyplanner_backend.pyharness_compat.pyor equivalent adapter。
CandidateAction、CoverageState、CoverageDelta、HarnessPolicy使用 dataclass 或 typed schema。- 当前
_topology_candidate_actions()和_runtime_candidate_actions()的逻辑必须迁移 到 operator registry。 - 所有 candidate patch 必须通过 legality validator。
- 所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。
- 当前 regression tests 可以保留为 scenario tests,但不得通过新增 hardcoded branch 来修。
- StopReport 必须包含
candidate_set_hash、grammar/policy/capability versions、 remaining/covered/blocked candidates。 - Inconclusive measurement 不能错误地产生
hypothesis_refuted或region_invalidated。
C. 静态 anti-overfitting 验收
CI 或测试中必须检查:
- Harness core 不包含 model name / run id / host name / known winning config 字符串。
- Harness core 不包含 case-specific action id。
- Score boost 不能针对单一 scenario test。
- Grammar 文件只能引用 family/operator/axis,不引用 testcase。
- 新增 scenario failure 时,PR 必须说明是补 grammar/operator,还是调整 policy config; 不允许直接补 testcase branch。
- Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。
- 新增或修改 policy threshold 必须包含 sensitivity test。
- 新增 operator/policy 后必须跑 held-out scenarios,不能只在触发该改动的 scenario 上通过。
D. 单元测试验收
必须新增 grammar-level tests:
- Legal candidate generation under random topology constraints。
- No-repeat property。
- Coverage monotonicity。
- Stop iff no uncovered high-priority candidate。
- Failure memory invalidates regions, not just exact signatures。
- Backend independence: greedy vs mock-LLM ranker candidate set 一致。
- Bad-start property tests:
- 对 ordered lattice 的 boundary starts,
bracket生成相邻 contrast candidate; - 对 bounded numeric axis 的 below-floor starts,
jump_to_floor生成 floor candidate; - 测试不得写具体
TP=8 -> TP=4或0.5 -> 0.9作为唯一条件,而应参数化。
- 对 ordered lattice 的 boundary starts,
- Candidate set determinism: 同一输入输出 byte-identical snapshot。
- Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。
- Combined candidate legality: 如果启用组合,组合后的 patch 和 coverage 必须合法。
- Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。
- Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。
- Stop falsification: 对已 stop 的 history,用 local grid/replay 检查是否漏掉 grammar 内 high-priority candidate。
E. 实验验收
Paper 证据必须至少包含:
-
Planner-agnostic ablation
- raw greedy / BO vs grammar-guided greedy / BO;
- weak LLM naive vs weak LLM + harness;
- strong LLM naive vs strong LLM + harness。
-
Mechanism ablation
- full grammar;
- no attribution;
- shuffled attribution;
- no grammar, raw knobs;
- no coverage validator;
- no failure memory。
-
Bad-start robustness
- 多个 random/adversarial starts;
- 不是只跑
TP=8,gmu=0.5,max-num-seqs=8; - 报告 convergence distribution,而不是单条成功路径。
-
Near-optimum check
- 至少一个 case 做 local grid/expert comparison;
- 报告 AITuner best 与 local grid best 的 gap。
-
Cross-regime check
- 至少两个不同 regime:long-prefill/tight-TTFT 和 decode/admission-heavy。
F. Paper claim 验收
不通过上述实验前,paper 只能说:
The current implementation demonstrates the feasibility of a deterministic,
mechanism-guided proposal loop on selected cases.
通过设计和代码验收后,可以说:
AITuner implements a planner-agnostic intervention grammar with coverage-relative
stop authority.
通过实验验收后,才可以说:
The harness improves fixed-budget tuning and convergence robustness across the
reported regimes, and the gain is not attributable solely to a stronger LLM or
case-specific prompt engineering.
仍然不能说:
The harness is complete over all configs.
The harness guarantees global optimum.
The harness is robust to arbitrary workloads/engines.
迁移计划
Phase 0: 冻结 rule growth
- 暂停向
harness.py添加新的 testcase-specific rules。 - 所有新 bad case 先写成 missing grammar/operator issue。
- Roadmap 中标记当前 rule-based harness 为 prototype。
Phase 1: Typed state and candidate schema
- 抽出
TrialProfile、BottleneckHypothesis、CandidateAction、CoverageState、HarnessPolicy。 - 同时抽出
AxisSpec、OperatorSpec、CandidateSet、CoverageUnit、CoverageDelta、MeasuredVerdict、BlockedCandidate、StopReport。 - 现有行为保持兼容,只改变结构。
- Golden tests 使用当前 outputs 防止 accidental regression。
Phase 2: Operator registry
- 先迁移 topology、kv_memory、admission、batching 四类。
- 每个 operator 负责生成 candidate 和 coverage effect。
- 旧
_topology_candidate_actions()、_runtime_candidate_actions()变成 compat wrapper。
Phase 3: Coverage validator
- 实现 coverage accounting。
- StopPolicy 先复现当前 guard 结果,再替换 family-count heuristic。
- 新增 coverage-relative stop tests。
Phase 4: Planner backend split
- Greedy planner 作为默认 backend。
- LLM backend 只能 rank/select candidate。
- BO/bandit backend 可后续接入。
Phase 5: Experiment rerun
- 重跑 no-LLM Qwen30B;
- 重跑 2x2;
- 跑 bad-start distribution;
- 跑 mechanism ablation;
- 跑 local grid/expert comparison。
最关键的判断标准
如果一个新 failure 只能通过下面方式修复:
在 planner 里新增一个针对具体 config/workload 的 if-else
那么这不是 harness contribution。
如果一个新 failure 通过下面方式修复:
新增或修正一个通用 operator / coverage predicate / engine capability declaration,
并且该变化改善一类 parameterized tests
它才可能是 harness contribution。
这条标准应该作为后续所有 PR 和实验解释的 gate。