373 lines
12 KiB
Markdown
373 lines
12 KiB
Markdown
# Bad-start stop counterexample - 2026-06-26
|
||
|
||
本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经
|
||
robust,而是攻击当前实现,确认它是否会从明显不合理的初始配置中恢复。
|
||
|
||
结论:
|
||
|
||
```text
|
||
当前 production/prototype harness 还不能支持 bad-start robustness claim。
|
||
|
||
它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop,
|
||
没有测试 topology/resource-efficiency contrast。
|
||
```
|
||
|
||
这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority
|
||
问题:measurement saturation 不能绕过 coverage-relative candidate set。
|
||
|
||
同时,这个反例也暴露了 measurement policy 的缺口:`search.high` 太小时,tuning 会被
|
||
offered-load ceiling 右截断。后续应该加入 `auto_search_high`,但它只能在已有 trace
|
||
sampling space 内自动校准;如果 `search.high=1.0` 仍然不能压到真实 capacity frontier,
|
||
系统必须主动报告 measurement ceiling 不足,并等待人类确认是否更换 trace、提高 trace
|
||
density 或启用额外负载生成方式。
|
||
|
||
## 实验设置
|
||
|
||
机器:`dash1`,8x H20。
|
||
|
||
目标:从一个故意不合理的初始配置开始:
|
||
|
||
```text
|
||
tensor-parallel-size = 8
|
||
data-parallel-size = 1
|
||
gpu-memory-utilization = 0.5
|
||
max-num-seqs = 8
|
||
LLM endpoint disabled
|
||
```
|
||
|
||
期望行为:
|
||
|
||
- harness 不应只因为 baseline feasible 就停止;
|
||
- 它至少应生成 topology/resource-efficiency contrast candidate;
|
||
- 对 `req/s/GPU` 目标,8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。
|
||
|
||
## Run A: 低 search.high
|
||
|
||
第一轮保留原始 `search.high=0.125`。
|
||
|
||
结果:
|
||
|
||
```text
|
||
trial-0001 completed
|
||
harness-stop-0002
|
||
tuning_stop_reason = harness_stop
|
||
validator reason = search_high_saturated_by_incumbent
|
||
best request_rate = 1.0333 total
|
||
best request_rate_per_gpu = 0.1292
|
||
pass_rate = 1.0
|
||
```
|
||
|
||
解释:这个 run 的 offered-load ceiling 太低,baseline 很容易 saturate `search.high`。
|
||
因此它不能区分“配置真的足够好”和“测量上限太低”。
|
||
|
||
## Run B: corrected high search ceiling
|
||
|
||
第二轮把 `search.high` 提到 `1.0`,保留同一个 bad-start 配置,`max_trials=3`。
|
||
|
||
远端产物:
|
||
|
||
```text
|
||
session = adv_badcase_corr_casea_20260626T095356Z
|
||
store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
|
||
spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
|
||
log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log
|
||
```
|
||
|
||
结果仍然是在 baseline 后 stop:
|
||
|
||
```text
|
||
trial-0001 completed
|
||
harness-stop-0002
|
||
no harness-proposal-0002.json
|
||
tuning_stop_reason = harness_stop
|
||
validator reason = search_high_saturated_by_incumbent
|
||
best sampling_u = 0.9375
|
||
best request_rate = 8.033333333333333
|
||
best request_rate_per_gpu = 1.0041666666666667
|
||
pass_rate = 1.0
|
||
```
|
||
|
||
Probe trace:
|
||
|
||
| sampling_u | request_rate | feasible |
|
||
| --- | ---: | --- |
|
||
| 0.5 | 4.6000 | true |
|
||
| 0.75 | 6.5167 | true |
|
||
| 0.875 | 7.5000 | true |
|
||
| 0.9375 | 8.0333 | true |
|
||
|
||
它触发 stop 的原因是当前 guard 计算:
|
||
|
||
```text
|
||
binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
|
||
= 0.0625
|
||
threshold_gap_to_high = 1.0 - 0.9375
|
||
= 0.0625
|
||
```
|
||
|
||
因此当前实现认为 incumbent 已经 saturate `search.high`。
|
||
|
||
## 为什么这是反例
|
||
|
||
当前 objective 是 SLO-constrained `req/s/GPU`,不是固定 8 GPU 的 total throughput。
|
||
一个 8-GPU incumbent saturate offered-load ceiling,并不能证明:
|
||
|
||
- 低 TP / 低 GPU 配置没有更高 `req/s/GPU`;
|
||
- 当前 topology 是资源效率最优;
|
||
- runtime knobs 已经进入合适 trust region;
|
||
- no-LLM harness 能从 bad start 中恢复。
|
||
|
||
所以这个 stop 是 unsound 的,至少相对于 bad-start robustness claim 是 unsound。
|
||
|
||
更形式化地说:
|
||
|
||
```text
|
||
search_high_saturated_by_incumbent
|
||
does not imply
|
||
incumbent_validated(topology/resource-efficiency)
|
||
```
|
||
|
||
当目标包含 resource efficiency,并且 parallel-size/topology 仍然 tunable 时,
|
||
`search_high_saturated_by_incumbent` 只能作为 measurement evidence,不能单独作为 stop
|
||
authority。
|
||
|
||
## 对新 harness 设计的约束
|
||
|
||
这个反例直接约束 declarative harness:
|
||
|
||
1. Stop 前必须生成并持久化完整 `CandidateSet`。
|
||
2. Stop proof 必须引用 `candidate_set_hash`。
|
||
3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidate,validator
|
||
必须返回 `eligible_candidates_remain`,即使 incumbent saturate `search.high`。
|
||
4. `search.high` saturation 只能更新 measurement coverage,不能替代
|
||
`incumbent_validated`。
|
||
5. 对 `req/s/GPU` objective,required coverage 必须包含至少一个 topology 或
|
||
resource-efficiency contrast,除非 StudySpec 明确固定 GPU budget 和 topology。
|
||
|
||
Measurement policy 约束:
|
||
|
||
1. `auto_search_high` 可以根据 trace 的 sampling threshold 和目标 GPU 规模自动提高
|
||
`search.high`,避免低 ceiling 让所有 config 过早 saturate。
|
||
2. 自动校准不能越过 trace 原生上限。当前 `sampling_u` 语义下,`search.high=1.0`
|
||
表示完整 trace。
|
||
3. 如果完整 trace 仍然被 incumbent 轻松 saturate,validator 不能假装搜索完成;它应该
|
||
输出 `measurement_ceiling_insufficient` 或把该事实作为 stop proof 的阻塞项。
|
||
4. 系统不得自动使用重复窗口、合成 arrivals 或 replay scaling 来扩大 workload,除非
|
||
StudySpec 显式启用,或人类确认该实验要测 synthetic/offline stress regime。
|
||
5. `measurement_ceiling_insufficient` 和 `eligible_candidates_remain` 是不同问题:前者说
|
||
load ceiling 不足,后者说 mechanism coverage 未完成。二者任一存在,都不能把结果
|
||
写成 bad-start robustness 成功。
|
||
|
||
这也说明当前 repair 方向不能是:
|
||
|
||
```text
|
||
if tp == 8 and gmu == 0.5: try tp = 4
|
||
```
|
||
|
||
正确方向应该是:
|
||
|
||
```text
|
||
ordered topology lattice + resource-efficiency objective
|
||
-> candidate set includes lower/redistributed topology contrast
|
||
-> stop is blocked until that coverage unit is measured or invalidated
|
||
```
|
||
|
||
## 当前 verdict
|
||
|
||
当前 production harness:
|
||
|
||
```text
|
||
prototype, not yet fundamental
|
||
```
|
||
|
||
新的 declarative prototype:
|
||
|
||
```text
|
||
promising substrate, but not production-proven
|
||
```
|
||
|
||
它已经把 `CandidateSet`、`CoverageUnit`、failure region 和 coverage-relative stop 的最小
|
||
接口跑通,但还没接入真实 tuning loop,也还没证明 bad-start distribution 的收敛。
|
||
|
||
因此接下来的 P0 gate 是:
|
||
|
||
```text
|
||
先实现 coverage-relative stop authority,再重跑 bad-start distribution。
|
||
```
|
||
|
||
## 2026-06-26 implementation validation
|
||
|
||
Commit `c8a0f98` 实现了第一片 production 修复:
|
||
|
||
- `search.auto_high` schema,默认关闭,旧配置兼容;
|
||
- trial materialization 时在已有 trace sampling space 内 resolve effective `search.high`;
|
||
- `trial_spec.json` 和 `result.json` 写入 auto-high / measurement evidence;
|
||
- `search_high_saturated_by_incumbent` 降级为 measurement evidence;
|
||
- 对 `req/s/GPU` 且 topology 可变的 study,high saturation 不能直接授权 stop;
|
||
- 固定 GPU product 但 TP/DP redistribution 可调时,仍视为 topology 可变;
|
||
- auto-high ceiling 低于 `search.low` 时不生成非法 search interval。
|
||
|
||
本地验证:
|
||
|
||
```text
|
||
PYTHONPATH=src python3 -m unittest discover -s tests
|
||
Ran 143 tests OK
|
||
```
|
||
|
||
dash1 validation:
|
||
|
||
```text
|
||
run label = adversarial-badstart-autohigh-c8a0f98-20260626T122622Z
|
||
git sha = c8a0f9870eac5438fb19be8edf1534a893723ab9
|
||
machine = dash1, 8x H20
|
||
```
|
||
|
||
Spec 仍使用 bad-start:
|
||
|
||
```text
|
||
tensor-parallel-size = 8
|
||
data-parallel-size = 1
|
||
gpu-memory-utilization = 0.5
|
||
max-num-seqs = 8
|
||
search.auto_high.enabled = true
|
||
```
|
||
|
||
Auto-high resolution:
|
||
|
||
```text
|
||
original_high = 1.0
|
||
effective_high = 0.9979913161468553
|
||
trace_max_sampling_u = 0.9979913161468553
|
||
reason = search_high_lowered_to_trace_ceiling
|
||
```
|
||
|
||
结果:
|
||
|
||
| trial | config patch | best sampling_u | request_rate | req/s/GPU | pass |
|
||
| --- | --- | ---: | ---: | ---: | ---: |
|
||
| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
|
||
| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9784 |
|
||
| trial-0003 | `tensor-parallel-size=8` | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
|
||
|
||
关键结论:
|
||
|
||
```text
|
||
旧 failure 已被修复:
|
||
baseline 后不再产生 harness-stop-0002/search_high_saturated_by_incumbent。
|
||
|
||
新实现产生 harness-proposal-0002,并测试 TP4 topology contrast。
|
||
TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。
|
||
```
|
||
|
||
这证明第一片修复解决了“measurement saturation 绕过 topology coverage”的问题。
|
||
|
||
但是 trial-0003 暴露了新 blocker:
|
||
|
||
```text
|
||
当前 no-repeat 仍基于 patch signature,而不是 normalized full-config signature。
|
||
```
|
||
|
||
`tensor-parallel-size=8` 对这个 study 的 base config 是 no-op,等价于 baseline TP8,
|
||
但系统仍把它当成一个新 proposal 执行。这说明下一片 P0 必须实现:
|
||
|
||
1. normalized full-config signature;
|
||
2. CandidateSet snapshot,包含 eligible 和 blocked candidates;
|
||
3. blocked reason,例如 `blocked_noop_equivalent_to_tested_full_config`;
|
||
4. Stop/report 中同时呈现 `measurement_ceiling_*` 和 `eligible_candidates_remain`。
|
||
|
||
因此当前 verdict 更新为:
|
||
|
||
```text
|
||
P0 measurement/stop-order slice: passed.
|
||
P0 full coverage-relative harness: not yet passed.
|
||
```
|
||
|
||
## 2026-06-26 normalized full-config validation
|
||
|
||
Commit `48911b6` 修复了上一节暴露的新 blocker:no-repeat 不再只比较 patch
|
||
signature,而是比较 normalized effective full config。
|
||
|
||
实现语义:
|
||
|
||
```text
|
||
effective_config =
|
||
normalize(base_envs + env_patch,
|
||
base_flags + flag_patch)
|
||
|
||
no_repeat_signature = stable_json(effective_config)
|
||
```
|
||
|
||
因此下面两个 proposal 在 validator 看来是同一个 full config:
|
||
|
||
```text
|
||
baseline patch: {}
|
||
noop patch: {"tensor-parallel-size": 8}
|
||
```
|
||
|
||
本地验证:
|
||
|
||
```text
|
||
PYTHONPATH=src python3 -m unittest discover -s tests
|
||
Ran 145 tests OK
|
||
```
|
||
|
||
dash1 validation:
|
||
|
||
```text
|
||
run label = adversarial-badstart-fullsig-48911b6-20260626T133112Z
|
||
git sha = 48911b658bbf052d70d952d1cdf55ad6b50ba7a5
|
||
machine = dash1, 8x H20
|
||
```
|
||
|
||
Spec 仍使用同一个 adversarial bad-start:
|
||
|
||
```text
|
||
tensor-parallel-size = 8
|
||
data-parallel-size = 1
|
||
gpu-memory-utilization = 0.5
|
||
max-num-seqs = 8
|
||
search.auto_high.enabled = true
|
||
LLM endpoint disabled
|
||
```
|
||
|
||
结果:
|
||
|
||
| trial | proposal | best sampling_u | request_rate | req/s/GPU | pass |
|
||
| --- | --- | ---: | ---: | ---: | ---: |
|
||
| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
|
||
| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9832 |
|
||
| trial-0003 | `tensor-parallel-size=4`, `gpu-memory-utilization=0.9` | 0.935616858887 | 8.00 | 2.0000 | 1.0000 |
|
||
|
||
关键 observation:
|
||
|
||
```text
|
||
旧 trial-0003:
|
||
{"tensor-parallel-size": 8}
|
||
-> 等价于 baseline,但仍被执行
|
||
|
||
新 trial-0003:
|
||
{"tensor-parallel-size": 4, "gpu-memory-utilization": 0.9}
|
||
-> 在已验证 TP4 topology 上继续测试 KV/cache headroom
|
||
```
|
||
|
||
这证明 normalized full-config signature 已经阻止了 patch-level no-op 重测。
|
||
|
||
机制解释:
|
||
|
||
1. baseline TP8 saturate search ceiling 只被记录为 measurement evidence;
|
||
2. 因为 objective 是 `req/s/GPU`,topology/resource-efficiency contrast 仍未覆盖,所以
|
||
validator 不允许 stop;
|
||
3. harness 先测试相邻低 TP topology,TP4 把 `req/s/GPU` 从 `1.0` 提高到 `1.7375`;
|
||
4. no-repeat 用 full config signature block 掉等价 TP8 patch;
|
||
5. harness 在 settled TP4 topology 上继续测试 runtime headroom,`gmu=0.9` 把
|
||
`req/s/GPU` 提高到 `2.0`。
|
||
|
||
当前 verdict 更新为:
|
||
|
||
```text
|
||
P0 measurement/stop-order slice: passed.
|
||
P0 normalized full-config no-repeat slice: passed.
|
||
P0 single adversarial bad-start recovery: passed for this case.
|
||
P0 distribution-level bad-start robustness: not yet proven.
|
||
```
|