Document auto search high policy

2026-06-26 19:53:30 +08:00
parent 384cb58f1f
commit 95ad124a1b
3 changed files with 68 additions and 1 deletions
--- a/docs/aituner-roadmap.md
+++ b/docs/aituner-roadmap.md
@@ -99,6 +99,9 @@ declarative intervention grammar + coverage-relative validator。
 - CoverageUnit 结构化，stop 不能只依赖 exact signature；
 - `search_high_saturated_by_incumbent` 不能绕过 CandidateSet coverage；对 `req/s/GPU`
  目标，未覆盖 topology/resource-efficiency contrast 时必须继续；
+- 加入 `auto_search_high` measurement policy：可在已有 trace 内自动提高 ceiling；若
+  `search.high=1.0` 仍然不足，必须报告 `measurement_ceiling_insufficient` 并等待人类
+  确认，不得静默重复窗口或合成 arrivals；
 - Failure invalidation 有保守 region predicate 和 retry/unblock 条件；
 - grammar/policy/capability 都有 version 和 anti-overfitting static checks；
 - LLM/BO 只能选择合法 candidate，不能绕过 validator。
--- a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md
+++ b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md
@@ -15,6 +15,12 @@ robust，而是攻击当前实现，确认它是否会从明显不合理的初
 这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority
 问题：measurement saturation 不能绕过 coverage-relative candidate set。

+同时，这个反例也暴露了 measurement policy 的缺口：`search.high` 太小时，tuning 会被
+offered-load ceiling 右截断。后续应该加入 `auto_search_high`，但它只能在已有 trace
+sampling space 内自动校准；如果 `search.high=1.0` 仍然不能压到真实 capacity frontier，
+系统必须主动报告 measurement ceiling 不足，并等待人类确认是否更换 trace、提高 trace
+density 或启用额外负载生成方式。
+
 ## 实验设置

 机器：`dash1`，8x H20。
@@ -138,6 +144,20 @@ authority。
 5. 对 `req/s/GPU` objective，required coverage 必须包含至少一个 topology 或
   resource-efficiency contrast，除非 StudySpec 明确固定 GPU budget 和 topology。

+Measurement policy 约束：
+
+1. `auto_search_high` 可以根据 trace 的 sampling threshold 和目标 GPU 规模自动提高
+   `search.high`，避免低 ceiling 让所有 config 过早 saturate。
+2. 自动校准不能越过 trace 原生上限。当前 `sampling_u` 语义下，`search.high=1.0`
+   表示完整 trace。
+3. 如果完整 trace 仍然被 incumbent 轻松 saturate，validator 不能假装搜索完成；它应该
+   输出 `measurement_ceiling_insufficient` 或把该事实作为 stop proof 的阻塞项。
+4. 系统不得自动使用重复窗口、合成 arrivals 或 replay scaling 来扩大 workload，除非
+   StudySpec 显式启用，或人类确认该实验要测 synthetic/offline stress regime。
+5. `measurement_ceiling_insufficient` 和 `eligible_candidates_remain` 是不同问题：前者说
+   load ceiling 不足，后者说 mechanism coverage 未完成。二者任一存在，都不能把结果
+   写成 bad-start robustness 成功。
+
 这也说明当前 repair 方向不能是：

 ```text
--- a/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
+++ b/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
@@ -68,6 +68,18 @@ remains tunable.
 stop。对 `req/s/GPU` objective，未覆盖的 topology/resource-efficiency contrast 必须阻止
 stop，除非 StudySpec 明确固定 topology/GPU budget。

+另外，measurement policy 需要独立处理：
+
+```text
+auto_search_high may raise the offered-load ceiling inside the existing trace,
+but it must not synthesize extra load without human approval.
+```
+
+如果 `search.high=1.0` 仍然不能让某个 topology 到达 capacity frontier，harness 应报告
+`measurement_ceiling_insufficient`，而不是自动重复窗口、合成 arrivals 或改变 replay
+semantics。这个报告可以阻止强 claim，也可以请求人类选择新的 trace 或显式 stress-test
+模式。
+
 ## 当前问题

 当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇：observation、
@@ -488,7 +500,8 @@ stop_allowed iff
  and candidate_set_snapshot is complete and persisted
  and no candidate with harness_priority >= threshold remains uncovered
  and incumbent has required validation coverage
-  and measurement budget/search-high condition is satisfied or no eligible candidate remains
+  and measurement budget/search-high condition is not right-censored
+  and no eligible candidate remains
 ```

 这里不用未定义的 `useful candidate`。一个 candidate 是否 eligible 由以下结构化条件决定：
@@ -542,6 +555,37 @@ Validator 输出必须包含：
 而是证明当前 grammar/operator/policy 下没有未覆盖的高优先级实验。
 ```

+### Measurement ceiling policy
+
+`search.high` 是 workload measurement 的 offered-load ceiling，不是 final objective。
+因此它由 MeasurementPolicy 管理，不能直接变成 stop authority。
+
+建议 schema：
+
+```yaml
+search:
+  low: 0.0
+  high: 0.125
+  auto_high:
+    enabled: true
+    max_sampling_u: 1.0
+    target_per_gpu_headroom: 1.5
+    require_human_confirmation_beyond_trace: true
+```
+
+行为：
+
+1. 如果 `auto_high.enabled=true`，preflight 根据 trace threshold distribution、窗口长度和
+   GPU budget 估计需要的 offered-load ceiling，并把 `search.high` 提高到不超过
+   `max_sampling_u` 的值。
+2. 对当前 `sampling_u` 语义，`max_sampling_u=1.0` 是完整 trace，上限不能自动超过它。
+3. 如果 incumbent 在 `search.high=1.0` 仍然 feasible 且接近上界，系统输出
+   `measurement_ceiling_insufficient`。
+4. `measurement_ceiling_insufficient` 不是 tuning success，也不是 global optimum proof；
+   它只说明当前 trace 不能继续区分更高 offered-load frontier。
+5. 重复窗口、合成 arrivals、改变 replay scale 或使用更 dense trace 都必须由 StudySpec
+   显式声明，或者等待人类确认。Harness 不能静默改变 workload semantics。
+
 ## Planner backend 分离

 Harness 产出 `CandidateSet`，planner backend 只做选择。