Add declarative harness prototype

2026-06-26 18:07:02 +08:00
parent 4075c7abf0
commit 384cb58f1f
5 changed files with 752 additions and 1 deletions
--- a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md
+++ b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md
@@ -0,0 +1,176 @@
+# Bad-start stop counterexample - 2026-06-26
+
+本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经
+robust，而是攻击当前实现，确认它是否会从明显不合理的初始配置中恢复。
+
+结论：
+
+```text
+当前 production/prototype harness 还不能支持 bad-start robustness claim。
+
+它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop，
+没有测试 topology/resource-efficiency contrast。
+```
+
+这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority
+问题：measurement saturation 不能绕过 coverage-relative candidate set。
+
+## 实验设置
+
+机器：`dash1`，8x H20。
+
+目标：从一个故意不合理的初始配置开始：
+
+```text
+tensor-parallel-size = 8
+data-parallel-size = 1
+gpu-memory-utilization = 0.5
+max-num-seqs = 8
+LLM endpoint disabled
+```
+
+期望行为：
+
+- harness 不应只因为 baseline feasible 就停止；
+- 它至少应生成 topology/resource-efficiency contrast candidate；
+- 对 `req/s/GPU` 目标，8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。
+
+## Run A: 低 search.high
+
+第一轮保留原始 `search.high=0.125`。
+
+结果：
+
+```text
+trial-0001 completed
+harness-stop-0002
+tuning_stop_reason = harness_stop
+validator reason = search_high_saturated_by_incumbent
+best request_rate = 1.0333 total
+best request_rate_per_gpu = 0.1292
+pass_rate = 1.0
+```
+
+解释：这个 run 的 offered-load ceiling 太低，baseline 很容易 saturate `search.high`。
+因此它不能区分“配置真的足够好”和“测量上限太低”。
+
+## Run B: corrected high search ceiling
+
+第二轮把 `search.high` 提到 `1.0`，保留同一个 bad-start 配置，`max_trials=3`。
+
+远端产物：
+
+```text
+session = adv_badcase_corr_casea_20260626T095356Z
+store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
+spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
+log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log
+```
+
+结果仍然是在 baseline 后 stop：
+
+```text
+trial-0001 completed
+harness-stop-0002
+no harness-proposal-0002.json
+tuning_stop_reason = harness_stop
+validator reason = search_high_saturated_by_incumbent
+best sampling_u = 0.9375
+best request_rate = 8.033333333333333
+best request_rate_per_gpu = 1.0041666666666667
+pass_rate = 1.0
+```
+
+Probe trace：
+
+| sampling_u | request_rate | feasible |
+| --- | ---: | --- |
+| 0.5 | 4.6000 | true |
+| 0.75 | 6.5167 | true |
+| 0.875 | 7.5000 | true |
+| 0.9375 | 8.0333 | true |
+
+它触发 stop 的原因是当前 guard 计算：
+
+```text
+binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
+                        = 0.0625
+threshold_gap_to_high = 1.0 - 0.9375
+                      = 0.0625
+```
+
+因此当前实现认为 incumbent 已经 saturate `search.high`。
+
+## 为什么这是反例
+
+当前 objective 是 SLO-constrained `req/s/GPU`，不是固定 8 GPU 的 total throughput。
+一个 8-GPU incumbent saturate offered-load ceiling，并不能证明：
+
+- 低 TP / 低 GPU 配置没有更高 `req/s/GPU`；
+- 当前 topology 是资源效率最优；
+- runtime knobs 已经进入合适 trust region；
+- no-LLM harness 能从 bad start 中恢复。
+
+所以这个 stop 是 unsound 的，至少相对于 bad-start robustness claim 是 unsound。
+
+更形式化地说：
+
+```text
+search_high_saturated_by_incumbent
+  does not imply
+incumbent_validated(topology/resource-efficiency)
+```
+
+当目标包含 resource efficiency，并且 parallel-size/topology 仍然 tunable 时，
+`search_high_saturated_by_incumbent` 只能作为 measurement evidence，不能单独作为 stop
+authority。
+
+## 对新 harness 设计的约束
+
+这个反例直接约束 declarative harness：
+
+1. Stop 前必须生成并持久化完整 `CandidateSet`。
+2. Stop proof 必须引用 `candidate_set_hash`。
+3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidate，validator
+   必须返回 `eligible_candidates_remain`，即使 incumbent saturate `search.high`。
+4. `search.high` saturation 只能更新 measurement coverage，不能替代
+   `incumbent_validated`。
+5. 对 `req/s/GPU` objective，required coverage 必须包含至少一个 topology 或
+   resource-efficiency contrast，除非 StudySpec 明确固定 GPU budget 和 topology。
+
+这也说明当前 repair 方向不能是：
+
+```text
+if tp == 8 and gmu == 0.5: try tp = 4
+```
+
+正确方向应该是：
+
+```text
+ordered topology lattice + resource-efficiency objective
+  -> candidate set includes lower/redistributed topology contrast
+  -> stop is blocked until that coverage unit is measured or invalidated
+```
+
+## 当前 verdict
+
+当前 production harness：
+
+```text
+prototype, not yet fundamental
+```
+
+新的 declarative prototype：
+
+```text
+promising substrate, but not production-proven
+```
+
+它已经把 `CandidateSet`、`CoverageUnit`、failure region 和 coverage-relative stop 的最小
+接口跑通，但还没接入真实 tuning loop，也还没证明 bad-start distribution 的收敛。
+
+因此接下来的 P0 gate 是：
+
+```text
+先实现 coverage-relative stop authority，再重跑 bad-start distribution。
+```
--- a/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
+++ b/docs/harness-ablation/declarative-intervention-harness-design-20260626.md
@@ -46,6 +46,28 @@ priority 中，仍然可能只是“换皮的 rule-based harness”。因此，

 下面的设计已经把这些 major revisions 纳入硬性要求。

+## 2026-06-26 adversarial status
+
+我们已经用 `TP=8, gmu=0.5, max-num-seqs=8` 的 bad-start case 攻击当前 production
+harness。结果显示当前 stop guard 会在 baseline 后触发
+`search_high_saturated_by_incumbent`，没有生成 topology/resource-efficiency contrast。
+这证明当前 implementation 还不是最终 contribution。
+
+详细反例见
+[Bad-start stop counterexample](bad-start-stop-counterexample-20260626.md)。
+
+该反例给本设计增加一个硬约束：
+
+```text
+search_high_saturated_by_incumbent may be measurement evidence,
+but it cannot bypass candidate-set coverage when topology/resource efficiency
+remains tunable.
+```
+
+因此新的 CoverageValidator 必须先证明没有未覆盖的 high-priority candidate，才能授权
+stop。对 `req/s/GPU` objective，未覆盖的 topology/resource-efficiency contrast 必须阻止
+stop，除非 StudySpec 明确固定 topology/GPU budget。
+
 ## 当前问题

 当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇：observation、