Add declarative harness prototype
This commit is contained in:
@@ -83,7 +83,7 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
|
||||
| C5. AITuner 找到 near-optimal region,而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
|
||||
| C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep;同时画 SLO tightness -> frontier/regime transition |
|
||||
| C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行,暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后,再做 SGLang adapter 和一个低成本验证 case |
|
||||
| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 当前 rule-based fix 只能作为 prototype 信号,不能作为最终 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator 后跑 random/adversarial start distribution |
|
||||
| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 当前发现反例,不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator + coverage-relative stop 后跑 random/adversarial start distribution |
|
||||
|
||||
## 最高优先级实验
|
||||
|
||||
@@ -97,6 +97,8 @@ declarative intervention grammar + coverage-relative validator。
|
||||
- CandidateSet 完整枚举并持久化 snapshot;
|
||||
- `harness_priority` 与 backend ranking 分离;
|
||||
- CoverageUnit 结构化,stop 不能只依赖 exact signature;
|
||||
- `search_high_saturated_by_incumbent` 不能绕过 CandidateSet coverage;对 `req/s/GPU`
|
||||
目标,未覆盖 topology/resource-efficiency contrast 时必须继续;
|
||||
- Failure invalidation 有保守 region predicate 和 retry/unblock 条件;
|
||||
- grammar/policy/capability 都有 version 和 anti-overfitting static checks;
|
||||
- LLM/BO 只能选择合法 candidate,不能绕过 validator。
|
||||
|
||||
176
docs/harness-ablation/bad-start-stop-counterexample-20260626.md
Normal file
176
docs/harness-ablation/bad-start-stop-counterexample-20260626.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Bad-start stop counterexample - 2026-06-26
|
||||
|
||||
本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经
|
||||
robust,而是攻击当前实现,确认它是否会从明显不合理的初始配置中恢复。
|
||||
|
||||
结论:
|
||||
|
||||
```text
|
||||
当前 production/prototype harness 还不能支持 bad-start robustness claim。
|
||||
|
||||
它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop,
|
||||
没有测试 topology/resource-efficiency contrast。
|
||||
```
|
||||
|
||||
这不是一个需要补 `TP=8 -> TP=4` 特例规则的问题。它暴露的是更基础的 stop authority
|
||||
问题:measurement saturation 不能绕过 coverage-relative candidate set。
|
||||
|
||||
## 实验设置
|
||||
|
||||
机器:`dash1`,8x H20。
|
||||
|
||||
目标:从一个故意不合理的初始配置开始:
|
||||
|
||||
```text
|
||||
tensor-parallel-size = 8
|
||||
data-parallel-size = 1
|
||||
gpu-memory-utilization = 0.5
|
||||
max-num-seqs = 8
|
||||
LLM endpoint disabled
|
||||
```
|
||||
|
||||
期望行为:
|
||||
|
||||
- harness 不应只因为 baseline feasible 就停止;
|
||||
- 它至少应生成 topology/resource-efficiency contrast candidate;
|
||||
- 对 `req/s/GPU` 目标,8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。
|
||||
|
||||
## Run A: 低 search.high
|
||||
|
||||
第一轮保留原始 `search.high=0.125`。
|
||||
|
||||
结果:
|
||||
|
||||
```text
|
||||
trial-0001 completed
|
||||
harness-stop-0002
|
||||
tuning_stop_reason = harness_stop
|
||||
validator reason = search_high_saturated_by_incumbent
|
||||
best request_rate = 1.0333 total
|
||||
best request_rate_per_gpu = 0.1292
|
||||
pass_rate = 1.0
|
||||
```
|
||||
|
||||
解释:这个 run 的 offered-load ceiling 太低,baseline 很容易 saturate `search.high`。
|
||||
因此它不能区分“配置真的足够好”和“测量上限太低”。
|
||||
|
||||
## Run B: corrected high search ceiling
|
||||
|
||||
第二轮把 `search.high` 提到 `1.0`,保留同一个 bad-start 配置,`max_trials=3`。
|
||||
|
||||
远端产物:
|
||||
|
||||
```text
|
||||
session = adv_badcase_corr_casea_20260626T095356Z
|
||||
store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
|
||||
spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
|
||||
log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log
|
||||
```
|
||||
|
||||
结果仍然是在 baseline 后 stop:
|
||||
|
||||
```text
|
||||
trial-0001 completed
|
||||
harness-stop-0002
|
||||
no harness-proposal-0002.json
|
||||
tuning_stop_reason = harness_stop
|
||||
validator reason = search_high_saturated_by_incumbent
|
||||
best sampling_u = 0.9375
|
||||
best request_rate = 8.033333333333333
|
||||
best request_rate_per_gpu = 1.0041666666666667
|
||||
pass_rate = 1.0
|
||||
```
|
||||
|
||||
Probe trace:
|
||||
|
||||
| sampling_u | request_rate | feasible |
|
||||
| --- | ---: | --- |
|
||||
| 0.5 | 4.6000 | true |
|
||||
| 0.75 | 6.5167 | true |
|
||||
| 0.875 | 7.5000 | true |
|
||||
| 0.9375 | 8.0333 | true |
|
||||
|
||||
它触发 stop 的原因是当前 guard 计算:
|
||||
|
||||
```text
|
||||
binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
|
||||
= 0.0625
|
||||
threshold_gap_to_high = 1.0 - 0.9375
|
||||
= 0.0625
|
||||
```
|
||||
|
||||
因此当前实现认为 incumbent 已经 saturate `search.high`。
|
||||
|
||||
## 为什么这是反例
|
||||
|
||||
当前 objective 是 SLO-constrained `req/s/GPU`,不是固定 8 GPU 的 total throughput。
|
||||
一个 8-GPU incumbent saturate offered-load ceiling,并不能证明:
|
||||
|
||||
- 低 TP / 低 GPU 配置没有更高 `req/s/GPU`;
|
||||
- 当前 topology 是资源效率最优;
|
||||
- runtime knobs 已经进入合适 trust region;
|
||||
- no-LLM harness 能从 bad start 中恢复。
|
||||
|
||||
所以这个 stop 是 unsound 的,至少相对于 bad-start robustness claim 是 unsound。
|
||||
|
||||
更形式化地说:
|
||||
|
||||
```text
|
||||
search_high_saturated_by_incumbent
|
||||
does not imply
|
||||
incumbent_validated(topology/resource-efficiency)
|
||||
```
|
||||
|
||||
当目标包含 resource efficiency,并且 parallel-size/topology 仍然 tunable 时,
|
||||
`search_high_saturated_by_incumbent` 只能作为 measurement evidence,不能单独作为 stop
|
||||
authority。
|
||||
|
||||
## 对新 harness 设计的约束
|
||||
|
||||
这个反例直接约束 declarative harness:
|
||||
|
||||
1. Stop 前必须生成并持久化完整 `CandidateSet`。
|
||||
2. Stop proof 必须引用 `candidate_set_hash`。
|
||||
3. 如果存在未覆盖的 high-priority topology/resource-efficiency candidate,validator
|
||||
必须返回 `eligible_candidates_remain`,即使 incumbent saturate `search.high`。
|
||||
4. `search.high` saturation 只能更新 measurement coverage,不能替代
|
||||
`incumbent_validated`。
|
||||
5. 对 `req/s/GPU` objective,required coverage 必须包含至少一个 topology 或
|
||||
resource-efficiency contrast,除非 StudySpec 明确固定 GPU budget 和 topology。
|
||||
|
||||
这也说明当前 repair 方向不能是:
|
||||
|
||||
```text
|
||||
if tp == 8 and gmu == 0.5: try tp = 4
|
||||
```
|
||||
|
||||
正确方向应该是:
|
||||
|
||||
```text
|
||||
ordered topology lattice + resource-efficiency objective
|
||||
-> candidate set includes lower/redistributed topology contrast
|
||||
-> stop is blocked until that coverage unit is measured or invalidated
|
||||
```
|
||||
|
||||
## 当前 verdict
|
||||
|
||||
当前 production harness:
|
||||
|
||||
```text
|
||||
prototype, not yet fundamental
|
||||
```
|
||||
|
||||
新的 declarative prototype:
|
||||
|
||||
```text
|
||||
promising substrate, but not production-proven
|
||||
```
|
||||
|
||||
它已经把 `CandidateSet`、`CoverageUnit`、failure region 和 coverage-relative stop 的最小
|
||||
接口跑通,但还没接入真实 tuning loop,也还没证明 bad-start distribution 的收敛。
|
||||
|
||||
因此接下来的 P0 gate 是:
|
||||
|
||||
```text
|
||||
先实现 coverage-relative stop authority,再重跑 bad-start distribution。
|
||||
```
|
||||
@@ -46,6 +46,28 @@ priority 中,仍然可能只是“换皮的 rule-based harness”。因此,
|
||||
|
||||
下面的设计已经把这些 major revisions 纳入硬性要求。
|
||||
|
||||
## 2026-06-26 adversarial status
|
||||
|
||||
我们已经用 `TP=8, gmu=0.5, max-num-seqs=8` 的 bad-start case 攻击当前 production
|
||||
harness。结果显示当前 stop guard 会在 baseline 后触发
|
||||
`search_high_saturated_by_incumbent`,没有生成 topology/resource-efficiency contrast。
|
||||
这证明当前 implementation 还不是最终 contribution。
|
||||
|
||||
详细反例见
|
||||
[Bad-start stop counterexample](bad-start-stop-counterexample-20260626.md)。
|
||||
|
||||
该反例给本设计增加一个硬约束:
|
||||
|
||||
```text
|
||||
search_high_saturated_by_incumbent may be measurement evidence,
|
||||
but it cannot bypass candidate-set coverage when topology/resource efficiency
|
||||
remains tunable.
|
||||
```
|
||||
|
||||
因此新的 CoverageValidator 必须先证明没有未覆盖的 high-priority candidate,才能授权
|
||||
stop。对 `req/s/GPU` objective,未覆盖的 topology/resource-efficiency contrast 必须阻止
|
||||
stop,除非 StudySpec 明确固定 topology/GPU budget。
|
||||
|
||||
## 当前问题
|
||||
|
||||
当前 `src/aituner/harness.py` 已经具备了一些正确的抽象词汇:observation、
|
||||
|
||||
Reference in New Issue
Block a user