From 4075c7abf0f02537c55e20c68c90e44a65cf0414 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Fri, 26 Jun 2026 17:15:06 +0800 Subject: [PATCH] Design declarative intervention harness --- docs/aituner-roadmap.md | 32 +- ...ve-intervention-harness-design-20260626.md | 981 ++++++++++++++++++ .../no-llm-harness-mechanism-20260625.md | 8 + 3 files changed, 1017 insertions(+), 4 deletions(-) create mode 100644 docs/harness-ablation/declarative-intervention-harness-design-20260626.md diff --git a/docs/aituner-roadmap.md b/docs/aituner-roadmap.md index 719feb0..1bc6123 100644 --- a/docs/aituner-roadmap.md +++ b/docs/aituner-roadmap.md @@ -39,6 +39,10 @@ M: measurement/scoring protocol SLO-constrained feasible frontier, req/s/GPU, latency quantiles, pass-rate guard ``` +当前 `src/aituner/harness.py` 是 prototype:它已经展示 no-LLM loop 和 mechanism-guided +proposal 的可行性,但仍然包含大量 rule-based heuristics,不能作为最终 harness +contribution。新的目标设计见 [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md)。 + Planner 是可替换的: ```text @@ -79,11 +83,28 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的 | C5. AITuner 找到 near-optimal region,而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 | | C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep;同时画 SLO tightness -> frontier/regime transition | | C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行,暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后,再做 SGLang adapter 和一个低成本验证 case | -| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | planner 规则和本地回归测试已补;真机待跑 | [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 从 `TP=8, max-num-seqs=8, gmu=0.5` 等坏起点做 no-LLM 真机 recovery run | +| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 当前 rule-based fix 只能作为 prototype 信号,不能作为最终 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator 后跑 random/adversarial start distribution | ## 最高优先级实验 -### P0a. Bad-start recovery confirmation +### P0a. Declarative harness redesign gate + +目的:停止继续向 `harness.py` 添加 testcase-specific rules,把 harness 重构成 +declarative intervention grammar + coverage-relative validator。 + +最低验收: + +- CandidateSet 完整枚举并持久化 snapshot; +- `harness_priority` 与 backend ranking 分离; +- CoverageUnit 结构化,stop 不能只依赖 exact signature; +- Failure invalidation 有保守 region predicate 和 retry/unblock 条件; +- grammar/policy/capability 都有 version 和 anti-overfitting static checks; +- LLM/BO 只能选择合法 candidate,不能绕过 validator。 + +优先级原因:如果不先完成这个 gate,继续扩展 bad-start/SLO/2x2 实验只是在证明一个 +rule-based prototype。 + +### P0b. Bad-start recovery confirmation after redesign 目的:回答 harness 是否只能从可信 base config 起步,还是能从明显不合理的初始 config 恢复到正确方向。 @@ -96,6 +117,9 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的 | bad-runtime | `TP=2, gmu=0.5, max-num-seqs=8` | 低 runtime headroom 会跳回 nominal floor | | combined-bad | `TP=8, gmu=0.5, max-num-seqs=8` | topology recovery 和 runtime recovery 能串联 | +注意:这不是先跑一条手工 bad case。必须在 declarative harness 之后跑 random/adversarial +start distribution,并报告分布结果。 + 预期图: - x-axis: trial index; @@ -103,9 +127,9 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的 - line groups: trusted-start vs bad-start cases; - annotation: proposal family sequence,例如 `TP downshift`, `gmu floor jump`, `gmu climb`。 -启动条件:先确认 dash fleet 有空闲 8xH20 机器;用户确认后再开跑。 +启动条件:先完成 P0a;再确认 dash fleet 有空闲 8xH20 机器;用户确认后再开跑。 -### P0b. 完成 Qwen235B decode 2x2 并整理 aggregate +### P0c. 完成 Qwen235B decode 2x2 并整理 aggregate 目的:补齐最核心的 `harness on/off x strong/weak planner` 证据,回答: diff --git a/docs/harness-ablation/declarative-intervention-harness-design-20260626.md b/docs/harness-ablation/declarative-intervention-harness-design-20260626.md new file mode 100644 index 0000000..c6e7fa2 --- /dev/null +++ b/docs/harness-ablation/declarative-intervention-harness-design-20260626.md @@ -0,0 +1,981 @@ +# Declarative Intervention Harness Design - 2026-06-26 + +本文重新定义 AITuner harness。目标是避免把 harness 做成一堆面向 testcase 的 +`if/else` 调参规则,并明确哪些性质可以被证明,哪些 claim 必须通过实验验证,哪些 +claim 不能说。 + +结论先说清楚: + +```text +harness 的贡献不是“专家规则知道该怎么调 vLLM”。 + +harness 的贡献应该是: +把 raw knob optimization 转成 declarative, coverage-aware experimental design。 +``` + +也就是说,harness 不应该直接 hardcode 下一个答案;它应该定义: + +- 什么状态是可观测的; +- 什么 intervention 是合法的; +- 每个 intervention 验证哪个系统假设; +- 哪些区域已经被测量、否定或覆盖; +- 什么时候 planner 还能继续选实验; +- 什么时候 stop 是 coverage-relative sound。 + +Planner 可以是 LLM、BO、bandit、greedy heuristic 或 oracle replay。Harness 是这些 +planner 共享的 substrate,不是某个 planner 的 prompt trick。 + +## Independent review verdict + +本文档初稿经过 fresh subagent 独立 review,结论是: + +```text +accept with major revisions +``` + +审稿意见的核心警告是:把规则从 Python `if/else` 搬到 grammar、operator、policy 或 +priority 中,仍然可能只是“换皮的 rule-based harness”。因此,本设计必须额外满足: + +- candidate set 必须完整枚举,并持久化 snapshot; +- priority/threshold 必须有 provenance、版本和 sensitivity test; +- coverage unit 必须结构化,不能只靠 `signature_tested`; +- failure invalidation 必须有保守 region predicate、边界和 retry/unblock 条件; +- stop proof 必须引用 candidate set snapshot; +- grammar/policy/capability 也必须接受 anti-overfitting 静态检查; +- LLM/BO 只能选择 candidate,不能暗中改变 legality、coverage 或 stop authority。 + +下面的设计已经把这些 major revisions 纳入硬性要求。 + +## 当前问题 + +当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇:observation、 +bottleneck hypotheses、candidate actions、validator stop。但实现仍然高度 rule-based: + +- observation、bottleneck attribution、candidate generation、scoring、stop validator + 都集中在一个 Python 文件里; +- knob family 的行为由 Python 分支和固定阈值决定; +- scoring 使用人工常数; +- stop validator 是 guard cascade,不是显式 coverage proof; +- 每次发现坏 case,容易继续加一个新 branch,例如 bad high-TP start 或 low-gmu start。 + +这类实现可以做工程原型,但不能成为系统贡献。它最多证明: + +```text +我们为已见过的一些 case 写出了一组能工作的 heuristic。 +``` + +它不能证明: + +- harness 是通用的; +- harness 对任意 bad start 都 robust; +- stop 时不存在未测更优 config; +- candidate generator 覆盖了所有重要 intervention; +- AITuner 的收益不是来自 testcase-specific rule accumulation。 + +因此,后续不能继续用“给 failure mode 补 rule”的方式推进。 + +## 设计目标 + +新的 harness 需要满足五个目标。 + +1. **Declarative** + 系统知识要以 grammar、operator、coverage predicate、policy config 的形式存在, + 而不是散落在 Python `if/else` 中。 + +2. **Planner-agnostic** + Harness 负责 legality、candidate construction、coverage accounting 和 stop + authority;planner 只负责在合法 candidate set 上排序或选择。 + +3. **Coverage-aware** + Stop 不是“没有规则触发”。Stop 必须相对于当前 grammar 和 operator set 有明确 + coverage 语义。 + +4. **Anti-overfitting** + 设计中不得出现 model name、case id、known winning config、固定 workload id 或实验 + 结果常数。Case 只能通过 `StudySpec`、engine capability、workload profile 和 trial + measurements 进入系统。 + +5. **Auditable** + 每个 proposal 必须能回答: + +```text +这个 trial 验证哪个 hypothesis? +它改变哪个 intervention dimension? +它的 confirm/reject condition 是什么? +trial 结束后 coverage 如何变化? +``` + +## 非目标 + +我们不追求也不应 claim 以下目标: + +- global optimum; +- 任意 config 空间的 completeness; +- 任意 serving engine/workload/SLO 的 robustness; +- bottleneck classifier 永远正确; +- stop 时真实世界不存在任何更优 config。 + +我们追求的是更严格但可证明的相对性质: + +```text +在声明的 grammar、operator set、engine constraints 和 measurement budget 下, +harness 的 candidate/stop 行为是合法、可覆盖、可审计、planner-agnostic 的。 +``` + +## 新架构 + +```text +StudySpec + EngineCapability + WorkloadProfile + TrialHistory + | + v +HarnessState + | + v +InterventionGrammar + | + v +OperatorRegistry -----> CandidateSet + | | + v v +CoverageState <------ MeasuredVerdict + | + v +CoverageValidator + | + v +PlannerBackend + | + v +Proposal or Stop +``` + +版本化对象: + +| Object | 必须版本化 | 进入哪些输出 | +| --- | --- | --- | +| `GrammarVersion` | yes | candidate id、candidate set snapshot、coverage delta、stop report | +| `PolicyVersion` | yes | candidate priority、blocked reason、stop threshold、stop report | +| `EngineCapabilityVersion` | yes | axis domain、safe range、candidate patch、failure invalidation | +| `PlannerBackendVersion` | yes | selected candidate、ranking trace | + +没有版本号的 candidate/stop 不能作为 paper 证据。 + +### HarnessState + +`HarnessState` 是 planner 看到的唯一状态入口。它由结构化数据构成: + +| Field | 来源 | 作用 | +| --- | --- | --- | +| `study` | `StudySpec` | tunable schema、topology constraints、SLO、search budget | +| `engine_capability` | static profile / engine adapter | safe ranges、knob lattice、unsupported combinations | +| `workload_profile` | trace summary / L-C-A profile | length/cache/arrival regime | +| `trial_profiles` | measured results | latency、SLO failures、throughput、launch failures | +| `incumbent` | state best | current best config and measured objective | +| `tested_signatures` | trial history | no-repeat and coverage accounting | +| `failure_memory` | launch/runtime failures | invalidate region or operator dimensions | + +重要约束: + +- `HarnessState` 不包含 natural-language prompt-only state; +- 所有字段必须可序列化、可 snapshot、可 replay; +- 同一个 `HarnessState + Grammar + Policy` 必须产生同一个 candidate set。 + +### InterventionGrammar + +Grammar 描述 serving tuning 的 intervention space。它不选择候选,只声明: + +- family; +- axes; +- legal value source; +- generic operators; +- expected effect schema; +- risk schema; +- coverage dimensions; +- required evidence。 + +建议初始 grammar: + +| Family | Axes | Operators | Coverage dimension | +| --- | --- | --- | --- | +| `topology` | TP, DP, EP, EP enable | `step_up`, `step_down`, `redistribute`, `bracket` | topology lattice neighborhood | +| `kv_memory` | `gpu-memory-utilization` | `jump_to_floor`, `local_climb`, `backoff_after_failure` | runtime trust region on incumbent topology | +| `admission` | `max-num-seqs` | `raise`, `lower`, `bracket` | sequence concurrency region | +| `batching` | `max-num-batched-tokens`, chunked prefill | `raise`, `lower`, `joint_adjust` | prefill/decode batching region | +| `allocator` | block size / cache knobs | `switch_categorical`, `bracket` | memory layout region | +| `failure` | failed signatures/regions | `invalidate_region`, `avoid_region` | failure memory coverage | + +Grammar example: + +```yaml +family: kv_memory +axes: + gpu-memory-utilization: + type: bounded_float + value_source: engine_capability + nominal_floor_key: gmu_nominal_floor + safe_ceiling_key: gmu_safe_ceiling +operators: + - name: jump_to_floor + preconditions: + - axis_below_nominal_floor + - incumbent_topology_preserved + coverage_effect: covers_runtime_floor + - name: local_climb + preconditions: + - axis_at_or_above_nominal_floor + - no_failed_higher_target + coverage_effect: extends_runtime_trust_region +``` + +这里没有写 `0.5 -> 0.9`。`0.9` 来自 engine capability/policy config, +`jump_to_floor` 是 bounded numeric axis 的通用 operator。 + +Topology example: + +```yaml +family: topology +axes: + tensor-parallel-size: + type: ordered_lattice + value_source: topology_constraints + data-parallel-size: + type: ordered_lattice + value_source: topology_constraints +operators: + - name: step_up + coverage_effect: covers_upper_neighbor + - name: step_down + coverage_effect: covers_lower_neighbor + - name: bracket + preconditions: + - anchor_at_boundary_or_unknown_region + - neighbor_uncovered + coverage_effect: brackets_ordered_axis + - name: redistribute + coverage_effect: covers_same_product_tp_dp_neighbor +``` + +这里没有写 `TP=8 -> TP=4`。它是 ordered lattice 上的 `bracket`。 + +### AxisSpec and OperatorSpec + +Grammar 必须由 typed schema 表达,不能只是自由文本。 + +`AxisSpec`: + +| Field | 含义 | +| --- | --- | +| `axis_id` | stable id,例如 `topology.tp` | +| `knob_keys` | 该 axis lower 到 config patch 时影响的 knobs | +| `type` | `ordered_lattice`, `bounded_float`, `bounded_int`, `categorical`, `coupled` | +| `domain_source` | `topology_constraints`, `engine_capability`, `policy_config`, `study_spec` | +| `domain` | legal values 或 range;必须可枚举或可离散化 | +| `order` | 对 ordered axis 的 total/partial order | +| `coupling` | 与其他 axes 的 legality constraints | +| `safe_region` | floor/ceiling/default/unsupported regions,带 provenance | + +`OperatorSpec`: + +| Field | 含义 | +| --- | --- | +| `operator_id` | stable id,例如 `topology.bracket` | +| `family` | topology / kv_memory / admission / batching / allocator / failure | +| `input_axes` | operator 作用的 axes | +| `preconditions` | generic predicates,必须引用 state/axis/evidence,不引用 testcase | +| `patch_fn` | 从 state 和 axis value 生成 candidate patch | +| `coverage_effects` | 产生的 structured `CoverageUnit` | +| `required_evidence` | 需要哪些 evidence 才可启用 | +| `risk_model` | risk/cost term 的来源和版本 | +| `confirm_reject_schema` | 如何从 measured verdict 更新 coverage | + +任何新 operator 必须先通过 schema validation 和 toy-lattice completeness tests。 + +### Generic Operators + +Operator 是可以复用的动作模板。Operator 不知道 model name、case id 或已知 winner。 + +必须支持的基础 operator: + +| Operator | 输入 | 输出 | 用途 | +| --- | --- | --- | --- | +| `step_up(axis)` | ordered axis | adjacent higher value | 验证上方 frontier | +| `step_down(axis)` | ordered axis | adjacent lower value | 验证下方 frontier | +| `bracket(axis)` | ordered axis + boundary/trust-region state | adjacent contrast point | 从偏置初始点建立局部 bracket | +| `jump_to_floor(axis)` | bounded axis + nominal floor | floor target | 从非法/低效区间回到 safe operating range | +| `local_climb(axis)` | bounded axis + step policy | next value | 在 trust region 内局部爬坡 | +| `backoff(axis)` | failed higher target | lower safe value | 处理 launch/OOM/regression | +| `redistribute(axis_a, axis_b)` | coupled axes + product constraints | legal coupled patch | TP/DP/EP redistribution | +| `joint_adjust(axes)` | coupled runtime axes | legal joint patch | MBT/MNS 交互 | +| `preserve(dimensions)` | incumbent | complete patch context | 保持 topology/runtime anchor | +| `block_if_seen(signature)` | tested signatures | reject candidate | no-repeat | +| `invalidate_region(failure)` | failure profile | region predicate | 避免重复失败区域 | + +Operator 只做 candidate construction,不做最终接受。接受由 real measurement 决定。 + +### CandidateAction + +每个 candidate 必须是结构化对象: + +```json +{ + "candidate_id": "topology.bracket/tp:8->4/dp:1", + "family": "topology", + "operator": "bracket", + "patch": {"flag_patch": {"tensor-parallel-size": 4}}, + "hypothesis": "Current TP may be an inefficient boundary point; lower neighbor tests topology efficiency.", + "targets": ["topology_efficiency", "ttft_prefill"], + "expected_metric_movement": { + "request_rate_per_gpu": "increase_or_same", + "ttft_p95": "may_increase", + "gpu_count": "decrease" + }, + "confirm_condition": "SLO-feasible req/s/GPU improves over incumbent or reveals same-family efficiency", + "reject_condition": "SLO feasibility or req/s/GPU regresses beyond tolerance", + "coverage_effect": ["covers_lower_neighbor(tp,dp)", "brackets_topology_axis(tp)"], + "risk": { + "launch": "low", + "regression": "medium", + "measurement_cost": "one_trial" + } +} +``` + +这比“try TP=4”严格:trial 被绑定到 hypothesis、metric movement 和 coverage update。 + +### CandidateSet + +`CandidateSet` 是 grammar/operator 在当前 state 下的完整枚举结果,而不是 planner 看到的 +前几个候选。 + +```text +CandidateSet = { + grammar_version, + policy_version, + engine_capability_version, + state_hash, + candidates[], + blocked_candidates[], + candidate_set_hash +} +``` + +要求: + +1. `OperatorRegistry.generate(state, grammar, policy)` 必须枚举所有 eligible candidates。 +2. 所有 filtered candidate 必须进入 `blocked_candidates`,并带 `blocked_reason`。 +3. Stop proof 必须引用 `candidate_set_hash`。 +4. 同一 `HarnessState + Grammar + Policy + EngineCapability` 必须生成 byte-identical + `CandidateSet`。 +5. 对 toy grammars,必须能穷举验证 generator 没漏 candidate。 + +这条是防止“coverage-relative stop”退化成“当前 generator 碰巧没生成候选”的关键。 + +### CoverageState + +Coverage 是新设计的核心。它记录“哪些机制空间已经被测过或否定”,不是只记录 best config。 + +Coverage units: + +| Unit | 含义 | +| --- | --- | +| `signature_tested` | exact config patch 已测 | +| `operator_covered` | 某 operator 在某 family/neighborhood 上至少被验证一次 | +| `neighbor_covered` | ordered lattice 的相邻上/下点已测 | +| `trust_region_covered` | incumbent topology 上某 runtime family 的 local region 已测 | +| `hypothesis_refuted` | candidate 的 reject condition 被满足 | +| `region_invalidated` | failure memory 否定一片 region,而不是单点 | +| `incumbent_validated` | incumbent 被 topology/runtime/failure dimensions 的反事实实验验证 | + +`CoverageUnit` 不能是自由字符串,必须是结构化 key: + +```json +{ + "unit_type": "neighbor_covered", + "family": "topology", + "axis": "topology.tp", + "operator": "bracket", + "anchor_region": {"tp": 8, "dp": 1}, + "candidate_region": {"tp": 4, "dp": 1}, + "bottleneck": "ttft_prefill", + "grammar_version": "..." +} +``` + +Stop 前不能只依赖 `signature_tested`。Required coverage 必须至少包含一个非平凡机制 +coverage unit,例如 `neighbor_covered`、`trust_region_covered`、`region_invalidated` +或 `incumbent_validated`。 + +Coverage update: + +```text +trial_result -> MeasuredVerdict -> CoverageDelta +``` + +每个 measured trial 必须至少产生一个 `CoverageDelta`: + +- 增加 `signature_tested`; +- 增加 family/operator coverage; +- invalidate failure region; +- update incumbent; +- refute/confirm hypothesis。 + +如果一个 trial 只消耗 GPU、但不改变任何 coverage 或 incumbent,它应被标记为 +`wasted_by_design`,这会直接违反验收标准。 + +更严格地说: + +```text +signature_tested alone is not enough. +``` + +一个 trial 只新增 exact signature、但没有 confirm/reject hypothesis、没有覆盖任何 +operator/neighborhood/trust region、没有更新 incumbent,也没有产生保守 failure +invalidation,就必须标记为 `non_informative_trial`。 + +### Failure invalidation + +`region_invalidated` 是最容易过拟合或误杀的部分,必须保守定义。 + +一个 failure invalidation 必须输出: + +```json +{ + "region_predicate": "topology.tp>=8 AND kv_memory.gmu>=0.99", + "source_failure": "engine_launch_oom", + "covered_config_examples": [...], + "excluded_config_examples": [...], + "conservative_reason": "failure only observed at higher memory target on same topology", + "retry_condition": "lower gmu or different topology is tested", + "expires_after": "optional policy-defined TTL" +} +``` + +禁止从单点 failure 无边界外推到整个 family。例如一次 `TP=8,gmu=0.99` OOM 不能直接 +invalidate 所有 `TP=8` 或所有高 TP;它最多 invalidates 包含相同或更高 memory pressure +的保守 region,除非后续 evidence 扩大该 region。 + +### CoverageValidator + +Stop 必须相对于 coverage,而不是相对于 Python guard cascade。 + +Stop condition: + +```text +stop_allowed iff + legality soundness holds + and candidate_set_snapshot is complete and persisted + and no candidate with harness_priority >= threshold remains uncovered + and incumbent has required validation coverage + and measurement budget/search-high condition is satisfied or no eligible candidate remains +``` + +这里不用未定义的 `useful candidate`。一个 candidate 是否 eligible 由以下结构化条件决定: + +```text +eligible(candidate) iff + candidate is legal + and candidate is not exact-repeat + and candidate.coverage_unit is not already covered + and candidate.region is not invalidated + and candidate.required_evidence is satisfied + and candidate.harness_priority >= policy.min_candidate_priority +``` + +`required validation coverage` 也必须结构化声明。例如: + +```yaml +required_validation: + incumbent: + - family: topology + unit_type: neighbor_covered + neighborhood: adjacent + - family: kv_memory + unit_type: trust_region_covered + when_axis_tunable: gpu-memory-utilization +``` + +Validator 输出必须包含: + +```json +{ + "should_stop": true, + "stop_kind": "coverage_relative_stop", + "candidate_set_hash": "...", + "grammar_version": "...", + "policy_version": "...", + "engine_capability_version": "...", + "covered_units": [...], + "remaining_candidates": [], + "blocked_candidates": [ + {"candidate_id": "...", "reason": "region_invalidated_by_launch_failure"} + ], + "proof_obligation": "No uncovered candidate above priority threshold under grammar version X." +} +``` + +这里的 proof 是 relative proof: + +```text +不是证明真实世界没有更优 config; +而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。 +``` + +## Planner backend 分离 + +Harness 产出 `CandidateSet`,planner backend 只做选择。 + +```text +CandidateSet = Grammar(HarnessState) +decision = PlannerBackend.rank_or_select(CandidateSet) +``` + +Planner backend 可以包括: + +| Backend | 作用 | +| --- | --- | +| deterministic greedy | 选择最高 priority / score candidate | +| LLM ranker | 读取 candidate set 和 evidence,只允许选择或组合合法 candidate | +| BO/bandit | 在 intervention graph 上学习 candidate priority | +| oracle replay | 用已有 grid/replay data 做离线验证 | + +关键约束: + +- planner 不得发明 grammar 外 knob; +- planner 不得绕过 validator stop; +- planner 只能改变 ranking/selection,不能改变 legality/coverage; +- 同一 candidate set 下,LLM 与 BO 的差异是 backend 差异,不是 harness 差异。 + +Priority 职责边界: + +- `harness_priority` 属于 harness/policy,用于 validator 判断 candidate 是否 high-priority。 +- `backend_score` 属于 planner backend,用于在 eligible candidate 中排序。 +- BO/bandit 可以学习 `backend_score`,但不能改变 `harness_priority`、coverage 和 legality。 +- 如果需要让学习结果影响 stop threshold,必须显式进入 `PolicyVersion`,并重新生成 + candidate set snapshot。 + +LLM 组合候选的限制: + +- 默认不允许 LLM 组合多个 candidates; +- 如果启用组合,组合必须生成新的 `CandidateAction`; +- 组合后的 patch 必须重新经过 legality validator; +- 组合后的 coverage effect 不能简单取 union,必须由 grammar 中的 `joint_adjust` + operator 生成; +- 没有 `joint_adjust` operator 的组合不允许执行。 + +## Required implementation interfaces + +后续代码重构必须先落这些接口,再迁移现有 heuristic。 + +```python +class InterventionGrammar: + @classmethod + def load(cls, path: str) -> "InterventionGrammar": ... + def validate(self, capability: "EngineCapability") -> None: ... + + +class OperatorRegistry: + def generate( + self, + state: "HarnessState", + grammar: "InterventionGrammar", + policy: "HarnessPolicy", + ) -> "CandidateSet": ... + + +class InterventionOperator: + def apply( + self, + axis: "AxisSpec", + state: "HarnessState", + policy: "HarnessPolicy", + ) -> "CandidateAction | BlockedCandidate": ... + + +class CoverageState: + def apply(self, verdict: "MeasuredVerdict") -> "CoverageDelta": ... + + +class CoverageValidator: + def validate_stop( + self, + state: "HarnessState", + candidate_set: "CandidateSet", + coverage: "CoverageState", + policy: "HarnessPolicy", + ) -> "StopReport": ... + + +class PlannerBackend: + def select( + self, + candidate_set: "CandidateSet", + state: "HarnessState", + ) -> "CandidateAction | StopRequest": ... +``` + +Required schemas: + +| Schema | 必须字段 | +| --- | --- | +| `CandidateSignature` | exact signature、normalized full config signature、partial patch signature、hash version | +| `MeasuredVerdict` | candidate id、trial id、status、objective delta、confirm/reject/inconclusive/failure/regression flags、raw metric refs | +| `CoverageDelta` | added units、invalidated regions、incumbent update、non-informative flag | +| `BlockedCandidate` | candidate skeleton、blocked reason、blocking predicate、retry condition | +| `StopReport` | candidate set hash、grammar/policy/capability versions、remaining/covered/blocked candidates、proof obligation | + +这些 schema 是 paper artifact 的一部分。没有它们,实验结果无法证明 harness 不是 +case-specific heuristic。 + +## Anti-overfitting 约束 + +### 禁止项 + +以下内容不得出现在 harness core、grammar 或 operator 中: + +- model name,例如 Qwen30B、Qwen27B; +- host name,例如 dash0; +- case id、run id、trace window id; +- known winning config,例如 “TP=2+gmu=0.97”; +- 固定 SLO 数值作为控制逻辑,例如 “TPOT=50ms 时做 X”; +- 根据实验结果补的特例 branch,例如 “如果 current_tp > 2 就强行 downshift”; +- 针对单个 regression test 的 action id 或 score boost。 + +### 允许项 + +以下内容可以存在,但必须声明来源和适用范围: + +- engine capability 中的 safe range,例如 `gpu-memory-utilization` nominal floor; +- topology constraints 中的 legal lattice,例如 TP/DP/EP 值; +- policy config 中的 measurement-cost / risk prior; +- workload profile 中的 length/cache/arrival regime; +- SLO rule 本身; +- measured trial evidence。 + +Priority/threshold provenance: + +所有 priority term、threshold、risk/cost prior 必须记录: + +```text +name +default value +source: engine docs / prior experiments / conservative default / calibrated profile +scope: global / engine-family / engine-version / hardware-class +sensitivity test +last updated commit +``` + +禁止为单个 scenario 或单条 run 调整 threshold。需要改 threshold 时,必须跑 sensitivity +test 和 held-out scenario tests。 + +### Rule 与 grammar 的边界 + +不是所有条件判断都禁止。禁止的是 testcase-specific branch。允许的是 generic predicate: + +| 禁止 | 允许 | +| --- | --- | +| `if model == Qwen30B` | `if family == topology and axis.type == ordered_lattice` | +| `if TP == 8 then TP=4` | `bracket(ordered_axis)` | +| `if gmu == 0.5 then 0.9` | `jump_to_floor(bounded_axis)` | +| `if TTFT4s/TPOT25 case` | `if bottleneck hypothesis targets TTFT/TPOT evidence` | +| `score = 0.74 for this bad-start fix` | `priority = relief_prior + information_gain - risk - cost` from policy terms | + +## 可证明性质 + +### P1. Legality soundness + +对于任意 generated candidate: + +```text +candidate.patch satisfies: + tunable_envs/tunable_flags + topology constraints + engine capability safe constraints + no forbidden knob +``` + +测试方式: + +- grammar-level property tests; +- random StudySpec / topology constraints fuzzing; +- all candidate patches pass `validate_proposal` equivalent checks。 + +### P2. No-repeat + +对于任意 candidate: + +```text +signature(candidate.patch) not in tested_signatures +``` + +如果 exact signature 被测过,candidate 必须被 filtered 或标记为 already-covered。 + +### P3. Coverage monotonicity + +每个 measured trial 后: + +```text +coverage_{t+1} >= coverage_t +``` + +其中 `>=` 表示 covered units、invalidated regions、tested signatures 和 incumbent evidence +不减少。 + +### P4. Coverage-relative stop soundness + +如果 validator 输出 stop: + +```text +for all candidate in persisted CandidateSet(grammar, state, policy): + candidate.harness_priority < threshold + or candidate.coverage_unit already covered + or candidate.region invalidated + or candidate violates constraints +``` + +这是相对于 grammar 的 soundness,不是 global optimality。 + +额外前提: + +- CandidateSet 必须是完整枚举; +- CandidateSet snapshot 必须持久化; +- StopReport 必须包含 `candidate_set_hash`; +- required coverage 不能只由 `signature_tested` 满足。 + +### P5. Backend independence + +给定相同 `HarnessState + Grammar + Policy`: + +```text +CandidateSet, harness_priority, and CoverageValidator outputs are identical across planner backends. +``` + +不同 backend 只能改变 `backend_score`、candidate ranking 或 tie-break。 + +### P6. Auditability + +每个 executed trial 必须能追溯: + +```text +proposal -> candidate_id -> operator -> hypothesis -> confirm/reject -> coverage_delta +``` + +缺失任一环节则不允许作为 paper 证据。 + +## 不能证明的性质 + +必须在 paper 中显式避免: + +- global optimum; +- arbitrary bad-start completeness; +- arbitrary workload/SLO generalization; +- engine-independent optimality; +- bottleneck classifier correctness; +- stop implies no better real config。 + +可以说: + +```text +Within the declared intervention grammar and measurement budget, AITuner provides +coverage-relative experimental control and empirically improves convergence on the +reported workloads. +``` + +## 验收标准 + +这部分是最严格的 gate。没有通过这些标准之前,不应把新 harness 当作 paper contribution。 + +### A. 设计验收 + +1. Harness core 不允许新增 testcase-specific branch。 +2. 所有 knob 行为必须来自 grammar/operator/policy config。 +3. Magic constants 必须集中到 `HarnessPolicy` 或 engine capability,并说明来源。 +4. Candidate generator 必须输出 `CandidateAction`,包含 hypothesis、operator、 + confirm/reject condition 和 coverage effect。 +5. Stop validator 必须输出 coverage-relative proof obligation。 +6. Planner backend 不能绕过 candidate set 或 stop validator。 +7. CandidateSet 必须完整枚举、持久化 snapshot,并由 stop report 引用。 +8. `harness_priority` 和 `backend_score` 必须分离。 +9. `CoverageUnit` 必须结构化,且 required coverage 不能只靠 exact signature。 +10. Failure invalidation 必须带保守 region predicate、边界、retry/unblock 条件。 + +### B. 代码验收 + +1. `harness.py` 不再继续膨胀;新实现拆分为: + - `harness_state.py` + - `intervention_grammar.py` + - `operators.py` + - `coverage.py` + - `planner_backend.py` + - `harness_compat.py` or equivalent adapter。 +2. `CandidateAction`、`CoverageState`、`CoverageDelta`、`HarnessPolicy` 使用 dataclass + 或 typed schema。 +3. 当前 `_topology_candidate_actions()` 和 `_runtime_candidate_actions()` 的逻辑必须迁移 + 到 operator registry。 +4. 所有 candidate patch 必须通过 legality validator。 +5. 所有 executed harness proposal 必须持久化 candidate id 和 coverage delta。 +6. 当前 regression tests 可以保留为 scenario tests,但不得通过新增 hardcoded branch + 来修。 +7. StopReport 必须包含 `candidate_set_hash`、grammar/policy/capability versions、 + remaining/covered/blocked candidates。 +8. Inconclusive measurement 不能错误地产生 `hypothesis_refuted` 或 `region_invalidated`。 + +### C. 静态 anti-overfitting 验收 + +CI 或测试中必须检查: + +1. Harness core 不包含 model name / run id / host name / known winning config 字符串。 +2. Harness core 不包含 case-specific action id。 +3. Score boost 不能针对单一 scenario test。 +4. Grammar 文件只能引用 family/operator/axis,不引用 testcase。 +5. 新增 scenario failure 时,PR 必须说明是补 grammar/operator,还是调整 policy config; + 不允许直接补 testcase branch。 +6. Policy/config/test fixture 也要扫描 known winning config 和 scenario-specific score boost。 +7. 新增或修改 policy threshold 必须包含 sensitivity test。 +8. 新增 operator/policy 后必须跑 held-out scenarios,不能只在触发该改动的 scenario 上通过。 + +### D. 单元测试验收 + +必须新增 grammar-level tests: + +1. Legal candidate generation under random topology constraints。 +2. No-repeat property。 +3. Coverage monotonicity。 +4. Stop iff no uncovered high-priority candidate。 +5. Failure memory invalidates regions, not just exact signatures。 +6. Backend independence: greedy vs mock-LLM ranker candidate set 一致。 +7. Bad-start property tests: + - 对 ordered lattice 的 boundary starts,`bracket` 生成相邻 contrast candidate; + - 对 bounded numeric axis 的 below-floor starts,`jump_to_floor` 生成 floor candidate; + - 测试不得写具体 `TP=8 -> TP=4` 或 `0.5 -> 0.9` 作为唯一条件,而应参数化。 +8. Candidate set determinism: 同一输入输出 byte-identical snapshot。 +9. Candidate set completeness on toy grammars: 小型 lattice 上穷举证明 generator 没漏。 +10. Combined candidate legality: 如果启用组合,组合后的 patch 和 coverage 必须合法。 +11. Inconclusive measurement: timeout/noisy/partial run 不应错误 refute hypothesis。 +12. Threshold sensitivity: stop decision 不应对微小 priority 常数变化高度脆弱。 +13. Stop falsification: 对已 stop 的 history,用 local grid/replay 检查是否漏掉 grammar 内 + high-priority candidate。 + +### E. 实验验收 + +Paper 证据必须至少包含: + +1. **Planner-agnostic ablation** + - raw greedy / BO vs grammar-guided greedy / BO; + - weak LLM naive vs weak LLM + harness; + - strong LLM naive vs strong LLM + harness。 + +2. **Mechanism ablation** + - full grammar; + - no attribution; + - shuffled attribution; + - no grammar, raw knobs; + - no coverage validator; + - no failure memory。 + +3. **Bad-start robustness** + - 多个 random/adversarial starts; + - 不是只跑 `TP=8,gmu=0.5,max-num-seqs=8`; + - 报告 convergence distribution,而不是单条成功路径。 + +4. **Near-optimum check** + - 至少一个 case 做 local grid/expert comparison; + - 报告 AITuner best 与 local grid best 的 gap。 + +5. **Cross-regime check** + - 至少两个不同 regime:long-prefill/tight-TTFT 和 decode/admission-heavy。 + +### F. Paper claim 验收 + +不通过上述实验前,paper 只能说: + +```text +The current implementation demonstrates the feasibility of a deterministic, +mechanism-guided proposal loop on selected cases. +``` + +通过设计和代码验收后,可以说: + +```text +AITuner implements a planner-agnostic intervention grammar with coverage-relative +stop authority. +``` + +通过实验验收后,才可以说: + +```text +The harness improves fixed-budget tuning and convergence robustness across the +reported regimes, and the gain is not attributable solely to a stronger LLM or +case-specific prompt engineering. +``` + +仍然不能说: + +```text +The harness is complete over all configs. +The harness guarantees global optimum. +The harness is robust to arbitrary workloads/engines. +``` + +## 迁移计划 + +### Phase 0: 冻结 rule growth + +- 暂停向 `harness.py` 添加新的 testcase-specific rules。 +- 所有新 bad case 先写成 missing grammar/operator issue。 +- Roadmap 中标记当前 rule-based harness 为 prototype。 + +### Phase 1: Typed state and candidate schema + +- 抽出 `TrialProfile`、`BottleneckHypothesis`、`CandidateAction`、`CoverageState`、 + `HarnessPolicy`。 +- 同时抽出 `AxisSpec`、`OperatorSpec`、`CandidateSet`、`CoverageUnit`、 + `CoverageDelta`、`MeasuredVerdict`、`BlockedCandidate`、`StopReport`。 +- 现有行为保持兼容,只改变结构。 +- Golden tests 使用当前 outputs 防止 accidental regression。 + +### Phase 2: Operator registry + +- 先迁移 topology、kv_memory、admission、batching 四类。 +- 每个 operator 负责生成 candidate 和 coverage effect。 +- 旧 `_topology_candidate_actions()`、`_runtime_candidate_actions()` 变成 compat wrapper。 + +### Phase 3: Coverage validator + +- 实现 coverage accounting。 +- StopPolicy 先复现当前 guard 结果,再替换 family-count heuristic。 +- 新增 coverage-relative stop tests。 + +### Phase 4: Planner backend split + +- Greedy planner 作为默认 backend。 +- LLM backend 只能 rank/select candidate。 +- BO/bandit backend 可后续接入。 + +### Phase 5: Experiment rerun + +- 重跑 no-LLM Qwen30B; +- 重跑 2x2; +- 跑 bad-start distribution; +- 跑 mechanism ablation; +- 跑 local grid/expert comparison。 + +## 最关键的判断标准 + +如果一个新 failure 只能通过下面方式修复: + +```text +在 planner 里新增一个针对具体 config/workload 的 if-else +``` + +那么这不是 harness contribution。 + +如果一个新 failure 通过下面方式修复: + +```text +新增或修正一个通用 operator / coverage predicate / engine capability declaration, +并且该变化改善一类 parameterized tests +``` + +它才可能是 harness contribution。 + +这条标准应该作为后续所有 PR 和实验解释的 gate。 diff --git a/docs/harness-ablation/no-llm-harness-mechanism-20260625.md b/docs/harness-ablation/no-llm-harness-mechanism-20260625.md index 50f728c..3f123e0 100644 --- a/docs/harness-ablation/no-llm-harness-mechanism-20260625.md +++ b/docs/harness-ablation/no-llm-harness-mechanism-20260625.md @@ -1,5 +1,13 @@ # No-LLM Harness Mechanism - 2026-06-25 +Status note, 2026-06-26: + +本文记录的是当前 rule-based prototype harness 的 no-LLM 机制和已有实验现象。它能证明 +AITuner 可以在没有 LLM endpoint 的情况下闭环运行,但不能证明 harness 的完备性、 +通用 robustness 或最终系统贡献。最终目标设计已经调整为 declarative intervention +grammar + coverage-relative validator,见 +[`declarative-intervention-harness-design-20260626.md`](declarative-intervention-harness-design-20260626.md)。 + 本文回答一个核心问题:如果不调用 LLM,harness 为什么还能自动找到配置? 结论先说清楚:no-LLM 模式下并不是“没有 planner”。当前 harness 本身就是一个