Files
aituner/docs/harness-ablation/bad-start-stop-counterexample-20260626.md

373 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bad-start stop counterexample - 2026-06-26
本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经
robust而是攻击当前实现确认它是否会从明显不合理的初始配置中恢复。
结论:
```text
当前 production/prototype harness 还不能支持 bad-start robustness claim。
它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop
没有测试 topology/resource-efficiency contrast。
```
这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority
问题measurement saturation 不能绕过 coverage-relative candidate set。
同时,这个反例也暴露了 measurement policy 的缺口:`search.high` 太小时tuning 会被
offered-load ceiling 右截断。后续应该加入 `auto_search_high`,但它只能在已有 trace
sampling space 内自动校准;如果 `search.high=1.0` 仍然不能压到真实 capacity frontier
系统必须主动报告 measurement ceiling 不足,并等待人类确认是否更换 trace、提高 trace
density 或启用额外负载生成方式。
## 实验设置
机器:`dash1`8x H20。
目标:从一个故意不合理的初始配置开始:
```text
tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
LLM endpoint disabled
```
期望行为:
- harness 不应只因为 baseline feasible 就停止;
- 它至少应生成 topology/resource-efficiency contrast candidate
-`req/s/GPU` 目标8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。
## Run A: 低 search.high
第一轮保留原始 `search.high=0.125`
结果:
```text
trial-0001 completed
harness-stop-0002
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best request_rate = 1.0333 total
best request_rate_per_gpu = 0.1292
pass_rate = 1.0
```
解释:这个 run 的 offered-load ceiling 太低baseline 很容易 saturate `search.high`
因此它不能区分“配置真的足够好”和“测量上限太低”。
## Run B: corrected high search ceiling
第二轮把 `search.high` 提到 `1.0`,保留同一个 bad-start 配置,`max_trials=3`
远端产物:
```text
session = adv_badcase_corr_casea_20260626T095356Z
store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log
```
结果仍然是在 baseline 后 stop
```text
trial-0001 completed
harness-stop-0002
no harness-proposal-0002.json
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best sampling_u = 0.9375
best request_rate = 8.033333333333333
best request_rate_per_gpu = 1.0041666666666667
pass_rate = 1.0
```
Probe trace
| sampling_u | request_rate | feasible |
| --- | ---: | --- |
| 0.5 | 4.6000 | true |
| 0.75 | 6.5167 | true |
| 0.875 | 7.5000 | true |
| 0.9375 | 8.0333 | true |
它触发 stop 的原因是当前 guard 计算:
```text
binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
= 0.0625
threshold_gap_to_high = 1.0 - 0.9375
= 0.0625
```
因此当前实现认为 incumbent 已经 saturate `search.high`
## 为什么这是反例
当前 objective 是 SLO-constrained `req/s/GPU`,不是固定 8 GPU 的 total throughput。
一个 8-GPU incumbent saturate offered-load ceiling并不能证明
- 低 TP / 低 GPU 配置没有更高 `req/s/GPU`
- 当前 topology 是资源效率最优;
- runtime knobs 已经进入合适 trust region
- no-LLM harness 能从 bad start 中恢复。
所以这个 stop 是 unsound 的,至少相对于 bad-start robustness claim 是 unsound。
更形式化地说:
```text
search_high_saturated_by_incumbent
does not imply
incumbent_validated(topology/resource-efficiency)
```
当目标包含 resource efficiency并且 parallel-size/topology 仍然 tunable 时,
`search_high_saturated_by_incumbent` 只能作为 measurement evidence不能单独作为 stop
authority。
## 对新 harness 设计的约束
这个反例直接约束 declarative harness
1. Stop 前必须生成并持久化完整 `CandidateSet`
2. Stop proof 必须引用 `candidate_set_hash`
3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidatevalidator
必须返回 `eligible_candidates_remain`,即使 incumbent saturate `search.high`
4. `search.high` saturation 只能更新 measurement coverage不能替代
`incumbent_validated`
5.`req/s/GPU` objectiverequired coverage 必须包含至少一个 topology 或
resource-efficiency contrast除非 StudySpec 明确固定 GPU budget 和 topology。
Measurement policy 约束:
1. `auto_search_high` 可以根据 trace 的 sampling threshold 和目标 GPU 规模自动提高
`search.high`,避免低 ceiling 让所有 config 过早 saturate。
2. 自动校准不能越过 trace 原生上限。当前 `sampling_u` 语义下,`search.high=1.0`
表示完整 trace。
3. 如果完整 trace 仍然被 incumbent 轻松 saturatevalidator 不能假装搜索完成;它应该
输出 `measurement_ceiling_insufficient` 或把该事实作为 stop proof 的阻塞项。
4. 系统不得自动使用重复窗口、合成 arrivals 或 replay scaling 来扩大 workload除非
StudySpec 显式启用,或人类确认该实验要测 synthetic/offline stress regime。
5. `measurement_ceiling_insufficient``eligible_candidates_remain` 是不同问题:前者说
load ceiling 不足,后者说 mechanism coverage 未完成。二者任一存在,都不能把结果
写成 bad-start robustness 成功。
这也说明当前 repair 方向不能是:
```text
if tp == 8 and gmu == 0.5: try tp = 4
```
正确方向应该是:
```text
ordered topology lattice + resource-efficiency objective
-> candidate set includes lower/redistributed topology contrast
-> stop is blocked until that coverage unit is measured or invalidated
```
## 当前 verdict
当前 production harness
```text
prototype, not yet fundamental
```
新的 declarative prototype
```text
promising substrate, but not production-proven
```
它已经把 `CandidateSet``CoverageUnit`、failure region 和 coverage-relative stop 的最小
接口跑通,但还没接入真实 tuning loop也还没证明 bad-start distribution 的收敛。
因此接下来的 P0 gate 是:
```text
先实现 coverage-relative stop authority再重跑 bad-start distribution。
```
## 2026-06-26 implementation validation
Commit `c8a0f98` 实现了第一片 production 修复:
- `search.auto_high` schema默认关闭旧配置兼容
- trial materialization 时在已有 trace sampling space 内 resolve effective `search.high`
- `trial_spec.json``result.json` 写入 auto-high / measurement evidence
- `search_high_saturated_by_incumbent` 降级为 measurement evidence
-`req/s/GPU` 且 topology 可变的 studyhigh saturation 不能直接授权 stop
- 固定 GPU product 但 TP/DP redistribution 可调时,仍视为 topology 可变;
- auto-high ceiling 低于 `search.low` 时不生成非法 search interval。
本地验证:
```text
PYTHONPATH=src python3 -m unittest discover -s tests
Ran 143 tests OK
```
dash1 validation
```text
run label = adversarial-badstart-autohigh-c8a0f98-20260626T122622Z
git sha = c8a0f9870eac5438fb19be8edf1534a893723ab9
machine = dash1, 8x H20
```
Spec 仍使用 bad-start
```text
tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
search.auto_high.enabled = true
```
Auto-high resolution
```text
original_high = 1.0
effective_high = 0.9979913161468553
trace_max_sampling_u = 0.9979913161468553
reason = search_high_lowered_to_trace_ceiling
```
结果:
| trial | config patch | best sampling_u | request_rate | req/s/GPU | pass |
| --- | --- | ---: | ---: | ---: | ---: |
| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9784 |
| trial-0003 | `tensor-parallel-size=8` | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
关键结论:
```text
旧 failure 已被修复:
baseline 后不再产生 harness-stop-0002/search_high_saturated_by_incumbent。
新实现产生 harness-proposal-0002并测试 TP4 topology contrast。
TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。
```
这证明第一片修复解决了“measurement saturation 绕过 topology coverage”的问题。
但是 trial-0003 暴露了新 blocker
```text
当前 no-repeat 仍基于 patch signature而不是 normalized full-config signature。
```
`tensor-parallel-size=8` 对这个 study 的 base config 是 no-op等价于 baseline TP8
但系统仍把它当成一个新 proposal 执行。这说明下一片 P0 必须实现:
1. normalized full-config signature
2. CandidateSet snapshot包含 eligible 和 blocked candidates
3. blocked reason例如 `blocked_noop_equivalent_to_tested_full_config`
4. Stop/report 中同时呈现 `measurement_ceiling_*``eligible_candidates_remain`
因此当前 verdict 更新为:
```text
P0 measurement/stop-order slice: passed.
P0 full coverage-relative harness: not yet passed.
```
## 2026-06-26 normalized full-config validation
Commit `48911b6` 修复了上一节暴露的新 blockerno-repeat 不再只比较 patch
signature而是比较 normalized effective full config。
实现语义:
```text
effective_config =
normalize(base_envs + env_patch,
base_flags + flag_patch)
no_repeat_signature = stable_json(effective_config)
```
因此下面两个 proposal 在 validator 看来是同一个 full config
```text
baseline patch: {}
noop patch: {"tensor-parallel-size": 8}
```
本地验证:
```text
PYTHONPATH=src python3 -m unittest discover -s tests
Ran 145 tests OK
```
dash1 validation
```text
run label = adversarial-badstart-fullsig-48911b6-20260626T133112Z
git sha = 48911b658bbf052d70d952d1cdf55ad6b50ba7a5
machine = dash1, 8x H20
```
Spec 仍使用同一个 adversarial bad-start
```text
tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
search.auto_high.enabled = true
LLM endpoint disabled
```
结果:
| trial | proposal | best sampling_u | request_rate | req/s/GPU | pass |
| --- | --- | ---: | ---: | ---: | ---: |
| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9832 |
| trial-0003 | `tensor-parallel-size=4`, `gpu-memory-utilization=0.9` | 0.935616858887 | 8.00 | 2.0000 | 1.0000 |
关键 observation
```text
旧 trial-0003:
{"tensor-parallel-size": 8}
-> 等价于 baseline但仍被执行
新 trial-0003:
{"tensor-parallel-size": 4, "gpu-memory-utilization": 0.9}
-> 在已验证 TP4 topology 上继续测试 KV/cache headroom
```
这证明 normalized full-config signature 已经阻止了 patch-level no-op 重测。
机制解释:
1. baseline TP8 saturate search ceiling 只被记录为 measurement evidence
2. 因为 objective 是 `req/s/GPU`topology/resource-efficiency contrast 仍未覆盖,所以
validator 不允许 stop
3. harness 先测试相邻低 TP topologyTP4 把 `req/s/GPU``1.0` 提高到 `1.7375`
4. no-repeat 用 full config signature block 掉等价 TP8 patch
5. harness 在 settled TP4 topology 上继续测试 runtime headroom`gmu=0.9`
`req/s/GPU` 提高到 `2.0`
当前 verdict 更新为:
```text
P0 measurement/stop-order slice: passed.
P0 normalized full-config no-repeat slice: passed.
P0 single adversarial bad-start recovery: passed for this case.
P0 distribution-level bad-start robustness: not yet proven.
```