Files
aituner/docs/harness-ablation/bad-start-stop-counterexample-20260626.md

12 KiB
Raw Blame History

Bad-start stop counterexample - 2026-06-26

本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经 robust而是攻击当前实现确认它是否会从明显不合理的初始配置中恢复。

结论:

当前 production/prototype harness 还不能支持 bad-start robustness claim。

它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop
没有测试 topology/resource-efficiency contrast。

这不是一个需要补 TP=8 -> TP=4 特例规则的问题。它暴露的是更基础的 stop authority 问题measurement saturation 不能绕过 coverage-relative candidate set。

同时,这个反例也暴露了 measurement policy 的缺口:search.high 太小时tuning 会被 offered-load ceiling 右截断。后续应该加入 auto_search_high,但它只能在已有 trace sampling space 内自动校准;如果 search.high=1.0 仍然不能压到真实 capacity frontier 系统必须主动报告 measurement ceiling 不足,并等待人类确认是否更换 trace、提高 trace density 或启用额外负载生成方式。

实验设置

机器:dash18x H20。

目标:从一个故意不合理的初始配置开始:

tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
LLM endpoint disabled

期望行为:

  • harness 不应只因为 baseline feasible 就停止;
  • 它至少应生成 topology/resource-efficiency contrast candidate
  • req/s/GPU 目标8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。

Run A: 低 search.high

第一轮保留原始 search.high=0.125

结果:

trial-0001 completed
harness-stop-0002
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best request_rate = 1.0333 total
best request_rate_per_gpu = 0.1292
pass_rate = 1.0

解释:这个 run 的 offered-load ceiling 太低baseline 很容易 saturate search.high。 因此它不能区分“配置真的足够好”和“测量上限太低”。

Run B: corrected high search ceiling

第二轮把 search.high 提到 1.0,保留同一个 bad-start 配置,max_trials=3

远端产物:

session = adv_badcase_corr_casea_20260626T095356Z
store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log

结果仍然是在 baseline 后 stop

trial-0001 completed
harness-stop-0002
no harness-proposal-0002.json
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best sampling_u = 0.9375
best request_rate = 8.033333333333333
best request_rate_per_gpu = 1.0041666666666667
pass_rate = 1.0

Probe trace

sampling_u request_rate feasible
0.5 4.6000 true
0.75 6.5167 true
0.875 7.5000 true
0.9375 8.0333 true

它触发 stop 的原因是当前 guard 计算:

binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
                        = 0.0625
threshold_gap_to_high = 1.0 - 0.9375
                      = 0.0625

因此当前实现认为 incumbent 已经 saturate search.high

为什么这是反例

当前 objective 是 SLO-constrained req/s/GPU,不是固定 8 GPU 的 total throughput。 一个 8-GPU incumbent saturate offered-load ceiling并不能证明

  • 低 TP / 低 GPU 配置没有更高 req/s/GPU
  • 当前 topology 是资源效率最优;
  • runtime knobs 已经进入合适 trust region
  • no-LLM harness 能从 bad start 中恢复。

所以这个 stop 是 unsound 的,至少相对于 bad-start robustness claim 是 unsound。

更形式化地说:

search_high_saturated_by_incumbent
  does not imply
incumbent_validated(topology/resource-efficiency)

当目标包含 resource efficiency并且 parallel-size/topology 仍然 tunable 时, search_high_saturated_by_incumbent 只能作为 measurement evidence不能单独作为 stop authority。

对新 harness 设计的约束

这个反例直接约束 declarative harness

  1. Stop 前必须生成并持久化完整 CandidateSet
  2. Stop proof 必须引用 candidate_set_hash
  3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidatevalidator 必须返回 eligible_candidates_remain,即使 incumbent saturate search.high
  4. search.high saturation 只能更新 measurement coverage不能替代 incumbent_validated
  5. req/s/GPU objectiverequired coverage 必须包含至少一个 topology 或 resource-efficiency contrast除非 StudySpec 明确固定 GPU budget 和 topology。

Measurement policy 约束:

  1. auto_search_high 可以根据 trace 的 sampling threshold 和目标 GPU 规模自动提高 search.high,避免低 ceiling 让所有 config 过早 saturate。
  2. 自动校准不能越过 trace 原生上限。当前 sampling_u 语义下,search.high=1.0 表示完整 trace。
  3. 如果完整 trace 仍然被 incumbent 轻松 saturatevalidator 不能假装搜索完成;它应该 输出 measurement_ceiling_insufficient 或把该事实作为 stop proof 的阻塞项。
  4. 系统不得自动使用重复窗口、合成 arrivals 或 replay scaling 来扩大 workload除非 StudySpec 显式启用,或人类确认该实验要测 synthetic/offline stress regime。
  5. measurement_ceiling_insufficienteligible_candidates_remain 是不同问题:前者说 load ceiling 不足,后者说 mechanism coverage 未完成。二者任一存在,都不能把结果 写成 bad-start robustness 成功。

这也说明当前 repair 方向不能是:

if tp == 8 and gmu == 0.5: try tp = 4

正确方向应该是:

ordered topology lattice + resource-efficiency objective
  -> candidate set includes lower/redistributed topology contrast
  -> stop is blocked until that coverage unit is measured or invalidated

当前 verdict

当前 production harness

prototype, not yet fundamental

新的 declarative prototype

promising substrate, but not production-proven

它已经把 CandidateSetCoverageUnit、failure region 和 coverage-relative stop 的最小 接口跑通,但还没接入真实 tuning loop也还没证明 bad-start distribution 的收敛。

因此接下来的 P0 gate 是:

先实现 coverage-relative stop authority再重跑 bad-start distribution。

2026-06-26 implementation validation

Commit c8a0f98 实现了第一片 production 修复:

  • search.auto_high schema默认关闭旧配置兼容
  • trial materialization 时在已有 trace sampling space 内 resolve effective search.high
  • trial_spec.jsonresult.json 写入 auto-high / measurement evidence
  • search_high_saturated_by_incumbent 降级为 measurement evidence
  • req/s/GPU 且 topology 可变的 studyhigh saturation 不能直接授权 stop
  • 固定 GPU product 但 TP/DP redistribution 可调时,仍视为 topology 可变;
  • auto-high ceiling 低于 search.low 时不生成非法 search interval。

本地验证:

PYTHONPATH=src python3 -m unittest discover -s tests
Ran 143 tests OK

dash1 validation

run label = adversarial-badstart-autohigh-c8a0f98-20260626T122622Z
git sha   = c8a0f9870eac5438fb19be8edf1534a893723ab9
machine   = dash1, 8x H20

Spec 仍使用 bad-start

tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
search.auto_high.enabled = true

Auto-high resolution

original_high       = 1.0
effective_high      = 0.9979913161468553
trace_max_sampling_u = 0.9979913161468553
reason              = search_high_lowered_to_trace_ceiling

结果:

trial config patch best sampling_u request_rate req/s/GPU pass
trial-0001 baseline TP8, DP1, gmu0.5, mns8 0.935616858887 8.00 1.0000 1.0000
trial-0002 tensor-parallel-size=4 0.810867944369 6.95 1.7375 0.9784
trial-0003 tensor-parallel-size=8 0.935616858887 8.00 1.0000 1.0000

关键结论:

旧 failure 已被修复:
baseline 后不再产生 harness-stop-0002/search_high_saturated_by_incumbent。

新实现产生 harness-proposal-0002并测试 TP4 topology contrast。
TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。

这证明第一片修复解决了“measurement saturation 绕过 topology coverage”的问题。

但是 trial-0003 暴露了新 blocker

当前 no-repeat 仍基于 patch signature而不是 normalized full-config signature。

tensor-parallel-size=8 对这个 study 的 base config 是 no-op等价于 baseline TP8 但系统仍把它当成一个新 proposal 执行。这说明下一片 P0 必须实现:

  1. normalized full-config signature
  2. CandidateSet snapshot包含 eligible 和 blocked candidates
  3. blocked reason例如 blocked_noop_equivalent_to_tested_full_config
  4. Stop/report 中同时呈现 measurement_ceiling_*eligible_candidates_remain

因此当前 verdict 更新为:

P0 measurement/stop-order slice: passed.
P0 full coverage-relative harness: not yet passed.

2026-06-26 normalized full-config validation

Commit 48911b6 修复了上一节暴露的新 blockerno-repeat 不再只比较 patch signature而是比较 normalized effective full config。

实现语义:

effective_config =
  normalize(base_envs + env_patch,
            base_flags + flag_patch)

no_repeat_signature = stable_json(effective_config)

因此下面两个 proposal 在 validator 看来是同一个 full config

baseline patch: {}
noop patch:    {"tensor-parallel-size": 8}

本地验证:

PYTHONPATH=src python3 -m unittest discover -s tests
Ran 145 tests OK

dash1 validation

run label = adversarial-badstart-fullsig-48911b6-20260626T133112Z
git sha   = 48911b658bbf052d70d952d1cdf55ad6b50ba7a5
machine   = dash1, 8x H20

Spec 仍使用同一个 adversarial bad-start

tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
search.auto_high.enabled = true
LLM endpoint disabled

结果:

trial proposal best sampling_u request_rate req/s/GPU pass
trial-0001 baseline TP8, DP1, gmu0.5, mns8 0.935616858887 8.00 1.0000 1.0000
trial-0002 tensor-parallel-size=4 0.810867944369 6.95 1.7375 0.9832
trial-0003 tensor-parallel-size=4, gpu-memory-utilization=0.9 0.935616858887 8.00 2.0000 1.0000

关键 observation

旧 trial-0003:
  {"tensor-parallel-size": 8}
  -> 等价于 baseline但仍被执行

新 trial-0003:
  {"tensor-parallel-size": 4, "gpu-memory-utilization": 0.9}
  -> 在已验证 TP4 topology 上继续测试 KV/cache headroom

这证明 normalized full-config signature 已经阻止了 patch-level no-op 重测。

机制解释:

  1. baseline TP8 saturate search ceiling 只被记录为 measurement evidence
  2. 因为 objective 是 req/s/GPUtopology/resource-efficiency contrast 仍未覆盖,所以 validator 不允许 stop
  3. harness 先测试相邻低 TP topologyTP4 把 req/s/GPU1.0 提高到 1.7375
  4. no-repeat 用 full config signature block 掉等价 TP8 patch
  5. harness 在 settled TP4 topology 上继续测试 runtime headroomgmu=0.9req/s/GPU 提高到 2.0

当前 verdict 更新为:

P0 measurement/stop-order slice: passed.
P0 normalized full-config no-repeat slice: passed.
P0 single adversarial bad-start recovery: passed for this case.
P0 distribution-level bad-start robustness: not yet proven.