Document bad-start robustness suite

This commit is contained in:
2026-06-26 22:19:46 +08:00
parent 2937539b49
commit bef260f183
2 changed files with 125 additions and 1 deletions

View File

@@ -0,0 +1,122 @@
# Bad-start robustness suite - 2026-06-26
本文定义 P0 bad-start robustness 的分布级验证。它不是新的 claim 结果,而是下一轮实验的
pre-registration先固定 starts、指标和 pass/fail再运行避免根据单个 case 调规则。
## 当前前提
已完成的代码 gate
- normalized full-config signature
- materialized effective signatureruntime-only proposal 先继承 incumbent topology 再签名;
- CLI hard-vetoLLM/manual/harness proposal 在进入 trial 前禁止重复 effective config
- CandidateSet audit`candidate_set_hash`、eligible/blocked candidates、blocked reason summary
- sidecar persistence`harness/candidate-set-*.json`
已通过的单 case
```text
TP8, DP1, gmu0.5, max-num-seqs8
-> TP4
-> TP4 + gmu0.9
```
这个 case 只能证明 sentinel recovery不能证明分布级 robustness。
## 实验矩阵
使用同一 Qwen30B-A3B community vLLM 0.20 bounded replay setup、no-LLM harness、
`search.auto_high.enabled=true`。先跑 fresh trusted-start control得到同 commit 下的
参考值 `R_ref`
| Group | N | Initial starts | 证明点 |
| --- | ---: | --- | --- |
| trusted control | 1 | 可信/default start | 定义 `R_ref` |
| topology-only | 4 | `(TP,DP)=(8,1),(4,2),(1,4),(2,4)`runtime nominal | 证明不是只会 `TP8 -> TP4` |
| runtime-only | 4 | `TP2/DP1` with `gmu={0.50,0.70}` and `max-num-seqs={8,16}` | 证明 runtime floor/admission recovery |
| combined | 4 | `TP8/gmu0.70/mns16`, `TP4/DP2/gmu0.50/mns8`, `TP1/DP8/gmu0.50/mns16`, `TP2/DP4/gmu0.70/mns8` | 证明 operators 可串联 |
| held-out random | 8 | fixed-seed stratified samples over legal topology x low/nominal `gmu` x low/normal `mns`,排除已通过 sentinel | overfit denominator |
总计1 control + 20 novel bad starts。
## Primary metrics
- best-so-far SLO-feasible `req/s/GPU / R_ref`
- time-to-95%-reference
- normalized AUC over trial budget
- final pass rate
- executed normalized full-config repeat count
- no-op blocked count
- candidate family / operator sequence
- stop reason and `candidate_set_hash`
每个 run 必须保留:
```text
state.json
proposals/*.json
harness/candidate-set-*.json
trials/trial-*/trial_spec.json
trials/trial-*/result.json
```
## Pass/fail
Run-level pass
```text
best_so_far_req_per_gpu >= 0.95 * R_ref within 12 measured trials
pass_rate >= 0.95
executed_effective_config_repeat_count == 0
no harness stop while high-priority eligible candidates remain
```
Suite-level pass
```text
20 / 20 novel bad starts pass.
```
如果任一 novel start 失败,不能 claim distribution-level bad-start robustness。修复后必须
冻结失败分析,并重新抽 held-out random set。
## Overfit guards
- pre-register all starts and random seed
- 不把已通过的 exact `TP8,gmu0.5,mns8` sentinel 计入 20-case denominator
- 不在 starts 之间调 threshold
- 报告 operator names例如 `topology_bracket`, `topology_redistribute`,
`runtime_floor_jump`, `admission_recovery`,而不是 case-specific action
- 每次 stop 必须引用 `candidate_set_hash` 和 no high-priority eligible candidate evidence。
## GPU cost
Expected
```text
21 runs * 6-8 measured trials = 126-168 trials
```
Hard cap
```text
21 runs * 12 measured trials = 252 trials
```
按当前 Qwen30B bounded replay 粗估:
```text
15-35 min / measured trial
expected = 250-780 H20 GPU-hours
cap = 500-1175 H20 GPU-hours
```
因此建议先跑 3-case pilot
| Pilot case | 起点 | 目的 |
| --- | --- | --- |
| topology-only | `TP=4, DP=2, gmu=0.9, mns=64` | 检查不是只会处理 TP8 |
| runtime-only | `TP=2, DP=1, gmu=0.5, mns=8` | 检查 runtime floor/admission recovery |
| combined | `TP=1, DP=8, gmu=0.5, mns=16` | 检查 topology + runtime 串联 |
Pilot 通过后再启动完整 20-case suite。