# Bad-start stop counterexample - 2026-06-26 本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经 robust,而是攻击当前实现,确认它是否会从明显不合理的初始配置中恢复。 结论: ```text 当前 production/prototype harness 还不能支持 bad-start robustness claim。 它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop, 没有测试 topology/resource-efficiency contrast。 ``` 这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority 问题:measurement saturation 不能绕过 coverage-relative candidate set。 ## 实验设置 机器:`dash1`,8x H20。 目标:从一个故意不合理的初始配置开始: ```text tensor-parallel-size = 8 data-parallel-size = 1 gpu-memory-utilization = 0.5 max-num-seqs = 8 LLM endpoint disabled ``` 期望行为: - harness 不应只因为 baseline feasible 就停止; - 它至少应生成 topology/resource-efficiency contrast candidate; - 对 `req/s/GPU` 目标,8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。 ## Run A: 低 search.high 第一轮保留原始 `search.high=0.125`。 结果: ```text trial-0001 completed harness-stop-0002 tuning_stop_reason = harness_stop validator reason = search_high_saturated_by_incumbent best request_rate = 1.0333 total best request_rate_per_gpu = 0.1292 pass_rate = 1.0 ``` 解释:这个 run 的 offered-load ceiling 太低,baseline 很容易 saturate `search.high`。 因此它不能区分“配置真的足够好”和“测量上限太低”。 ## Run B: corrected high search ceiling 第二轮把 `search.high` 提到 `1.0`,保留同一个 bad-start 配置,`max_trials=3`。 远端产物: ```text session = adv_badcase_corr_casea_20260626T095356Z store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log ``` 结果仍然是在 baseline 后 stop: ```text trial-0001 completed harness-stop-0002 no harness-proposal-0002.json tuning_stop_reason = harness_stop validator reason = search_high_saturated_by_incumbent best sampling_u = 0.9375 best request_rate = 8.033333333333333 best request_rate_per_gpu = 1.0041666666666667 pass_rate = 1.0 ``` Probe trace: | sampling_u | request_rate | feasible | | --- | ---: | --- | | 0.5 | 4.6000 | true | | 0.75 | 6.5167 | true | | 0.875 | 7.5000 | true | | 0.9375 | 8.0333 | true | 它触发 stop 的原因是当前 guard 计算: ```text binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes) = 0.0625 threshold_gap_to_high = 1.0 - 0.9375 = 0.0625 ``` 因此当前实现认为 incumbent 已经 saturate `search.high`。 ## 为什么这是反例 当前 objective 是 SLO-constrained `req/s/GPU`,不是固定 8 GPU 的 total throughput。 一个 8-GPU incumbent saturate offered-load ceiling,并不能证明: - 低 TP / 低 GPU 配置没有更高 `req/s/GPU`; - 当前 topology 是资源效率最优; - runtime knobs 已经进入合适 trust region; - no-LLM harness 能从 bad start 中恢复。 所以这个 stop 是 unsound 的,至少相对于 bad-start robustness claim 是 unsound。 更形式化地说: ```text search_high_saturated_by_incumbent does not imply incumbent_validated(topology/resource-efficiency) ``` 当目标包含 resource efficiency,并且 parallel-size/topology 仍然 tunable 时, `search_high_saturated_by_incumbent` 只能作为 measurement evidence,不能单独作为 stop authority。 ## 对新 harness 设计的约束 这个反例直接约束 declarative harness: 1. Stop 前必须生成并持久化完整 `CandidateSet`。 2. Stop proof 必须引用 `candidate_set_hash`。 3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidate,validator 必须返回 `eligible_candidates_remain`,即使 incumbent saturate `search.high`。 4. `search.high` saturation 只能更新 measurement coverage,不能替代 `incumbent_validated`。 5. 对 `req/s/GPU` objective,required coverage 必须包含至少一个 topology 或 resource-efficiency contrast,除非 StudySpec 明确固定 GPU budget 和 topology。 这也说明当前 repair 方向不能是: ```text if tp == 8 and gmu == 0.5: try tp = 4 ``` 正确方向应该是: ```text ordered topology lattice + resource-efficiency objective -> candidate set includes lower/redistributed topology contrast -> stop is blocked until that coverage unit is measured or invalidated ``` ## 当前 verdict 当前 production harness: ```text prototype, not yet fundamental ``` 新的 declarative prototype: ```text promising substrate, but not production-proven ``` 它已经把 `CandidateSet`、`CoverageUnit`、failure region 和 coverage-relative stop 的最小 接口跑通,但还没接入真实 tuning loop,也还没证明 bad-start distribution 的收敛。 因此接下来的 P0 gate 是: ```text 先实现 coverage-relative stop authority,再重跑 bad-start distribution。 ```