Document bad-start robustness suite
This commit is contained in:
@@ -83,7 +83,7 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的
|
|||||||
| C5. AITuner 找到 near-optimal region,而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
|
| C5. AITuner 找到 near-optimal region,而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 |
|
||||||
| C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep;同时画 SLO tightness -> frontier/regime transition |
|
| C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep;同时画 SLO tightness -> frontier/regime transition |
|
||||||
| C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行,暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后,再做 SGLang adapter 和一个低成本验证 case |
|
| C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行,暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后,再做 SGLang adapter 和一个低成本验证 case |
|
||||||
| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 单个 adversarial bad-start 已通过 first repair;分布级 robustness 不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator + coverage-relative stop 后跑 random/adversarial start distribution |
|
| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 单个 adversarial bad-start 已通过 first repair;分布级 robustness 不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [Bad-start robustness suite](harness-ablation/bad-start-robustness-suite-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 按 pre-registered 20-case suite 跑 random/adversarial start distribution |
|
||||||
|
|
||||||
## 最高优先级实验
|
## 最高优先级实验
|
||||||
|
|
||||||
@@ -124,6 +124,8 @@ rule-based prototype。
|
|||||||
目的:回答 harness 是否只能从可信 base config 起步,还是能从明显不合理的初始 config
|
目的:回答 harness 是否只能从可信 base config 起步,还是能从明显不合理的初始 config
|
||||||
恢复到正确方向。
|
恢复到正确方向。
|
||||||
|
|
||||||
|
Pre-registered suite 见 [Bad-start robustness suite](harness-ablation/bad-start-robustness-suite-20260626.md)。
|
||||||
|
|
||||||
最小实验矩阵:
|
最小实验矩阵:
|
||||||
|
|
||||||
| Case | 初始配置 | 证明点 |
|
| Case | 初始配置 | 证明点 |
|
||||||
|
|||||||
122
docs/harness-ablation/bad-start-robustness-suite-20260626.md
Normal file
122
docs/harness-ablation/bad-start-robustness-suite-20260626.md
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# Bad-start robustness suite - 2026-06-26
|
||||||
|
|
||||||
|
本文定义 P0 bad-start robustness 的分布级验证。它不是新的 claim 结果,而是下一轮实验的
|
||||||
|
pre-registration:先固定 starts、指标和 pass/fail,再运行,避免根据单个 case 调规则。
|
||||||
|
|
||||||
|
## 当前前提
|
||||||
|
|
||||||
|
已完成的代码 gate:
|
||||||
|
|
||||||
|
- normalized full-config signature;
|
||||||
|
- materialized effective signature:runtime-only proposal 先继承 incumbent topology 再签名;
|
||||||
|
- CLI hard-veto:LLM/manual/harness proposal 在进入 trial 前禁止重复 effective config;
|
||||||
|
- CandidateSet audit:`candidate_set_hash`、eligible/blocked candidates、blocked reason summary;
|
||||||
|
- sidecar persistence:`harness/candidate-set-*.json`。
|
||||||
|
|
||||||
|
已通过的单 case:
|
||||||
|
|
||||||
|
```text
|
||||||
|
TP8, DP1, gmu0.5, max-num-seqs8
|
||||||
|
-> TP4
|
||||||
|
-> TP4 + gmu0.9
|
||||||
|
```
|
||||||
|
|
||||||
|
这个 case 只能证明 sentinel recovery,不能证明分布级 robustness。
|
||||||
|
|
||||||
|
## 实验矩阵
|
||||||
|
|
||||||
|
使用同一 Qwen30B-A3B community vLLM 0.20 bounded replay setup、no-LLM harness、
|
||||||
|
`search.auto_high.enabled=true`。先跑 fresh trusted-start control,得到同 commit 下的
|
||||||
|
参考值 `R_ref`。
|
||||||
|
|
||||||
|
| Group | N | Initial starts | 证明点 |
|
||||||
|
| --- | ---: | --- | --- |
|
||||||
|
| trusted control | 1 | 可信/default start | 定义 `R_ref` |
|
||||||
|
| topology-only | 4 | `(TP,DP)=(8,1),(4,2),(1,4),(2,4)`,runtime nominal | 证明不是只会 `TP8 -> TP4` |
|
||||||
|
| runtime-only | 4 | `TP2/DP1` with `gmu={0.50,0.70}` and `max-num-seqs={8,16}` | 证明 runtime floor/admission recovery |
|
||||||
|
| combined | 4 | `TP8/gmu0.70/mns16`, `TP4/DP2/gmu0.50/mns8`, `TP1/DP8/gmu0.50/mns16`, `TP2/DP4/gmu0.70/mns8` | 证明 operators 可串联 |
|
||||||
|
| held-out random | 8 | fixed-seed stratified samples over legal topology x low/nominal `gmu` x low/normal `mns`,排除已通过 sentinel | overfit denominator |
|
||||||
|
|
||||||
|
总计:1 control + 20 novel bad starts。
|
||||||
|
|
||||||
|
## Primary metrics
|
||||||
|
|
||||||
|
- best-so-far SLO-feasible `req/s/GPU / R_ref`;
|
||||||
|
- time-to-95%-reference;
|
||||||
|
- normalized AUC over trial budget;
|
||||||
|
- final pass rate;
|
||||||
|
- executed normalized full-config repeat count;
|
||||||
|
- no-op blocked count;
|
||||||
|
- candidate family / operator sequence;
|
||||||
|
- stop reason and `candidate_set_hash`。
|
||||||
|
|
||||||
|
每个 run 必须保留:
|
||||||
|
|
||||||
|
```text
|
||||||
|
state.json
|
||||||
|
proposals/*.json
|
||||||
|
harness/candidate-set-*.json
|
||||||
|
trials/trial-*/trial_spec.json
|
||||||
|
trials/trial-*/result.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pass/fail
|
||||||
|
|
||||||
|
Run-level pass:
|
||||||
|
|
||||||
|
```text
|
||||||
|
best_so_far_req_per_gpu >= 0.95 * R_ref within 12 measured trials
|
||||||
|
pass_rate >= 0.95
|
||||||
|
executed_effective_config_repeat_count == 0
|
||||||
|
no harness stop while high-priority eligible candidates remain
|
||||||
|
```
|
||||||
|
|
||||||
|
Suite-level pass:
|
||||||
|
|
||||||
|
```text
|
||||||
|
20 / 20 novel bad starts pass.
|
||||||
|
```
|
||||||
|
|
||||||
|
如果任一 novel start 失败,不能 claim distribution-level bad-start robustness。修复后必须
|
||||||
|
冻结失败分析,并重新抽 held-out random set。
|
||||||
|
|
||||||
|
## Overfit guards
|
||||||
|
|
||||||
|
- pre-register all starts and random seed;
|
||||||
|
- 不把已通过的 exact `TP8,gmu0.5,mns8` sentinel 计入 20-case denominator;
|
||||||
|
- 不在 starts 之间调 threshold;
|
||||||
|
- 报告 operator names,例如 `topology_bracket`, `topology_redistribute`,
|
||||||
|
`runtime_floor_jump`, `admission_recovery`,而不是 case-specific action;
|
||||||
|
- 每次 stop 必须引用 `candidate_set_hash` 和 no high-priority eligible candidate evidence。
|
||||||
|
|
||||||
|
## GPU cost
|
||||||
|
|
||||||
|
Expected:
|
||||||
|
|
||||||
|
```text
|
||||||
|
21 runs * 6-8 measured trials = 126-168 trials
|
||||||
|
```
|
||||||
|
|
||||||
|
Hard cap:
|
||||||
|
|
||||||
|
```text
|
||||||
|
21 runs * 12 measured trials = 252 trials
|
||||||
|
```
|
||||||
|
|
||||||
|
按当前 Qwen30B bounded replay 粗估:
|
||||||
|
|
||||||
|
```text
|
||||||
|
15-35 min / measured trial
|
||||||
|
expected = 250-780 H20 GPU-hours
|
||||||
|
cap = 500-1175 H20 GPU-hours
|
||||||
|
```
|
||||||
|
|
||||||
|
因此建议先跑 3-case pilot:
|
||||||
|
|
||||||
|
| Pilot case | 起点 | 目的 |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| topology-only | `TP=4, DP=2, gmu=0.9, mns=64` | 检查不是只会处理 TP8 |
|
||||||
|
| runtime-only | `TP=2, DP=1, gmu=0.5, mns=8` | 检查 runtime floor/admission recovery |
|
||||||
|
| combined | `TP=1, DP=8, gmu=0.5, mns=16` | 检查 topology + runtime 串联 |
|
||||||
|
|
||||||
|
Pilot 通过后再启动完整 20-case suite。
|
||||||
Reference in New Issue
Block a user