3.8 KiB
3.8 KiB
Bad-start robustness suite - 2026-06-26
本文定义 P0 bad-start robustness 的分布级验证。它不是新的 claim 结果,而是下一轮实验的 pre-registration:先固定 starts、指标和 pass/fail,再运行,避免根据单个 case 调规则。
当前前提
已完成的代码 gate:
- normalized full-config signature;
- materialized effective signature:runtime-only proposal 先继承 incumbent topology 再签名;
- CLI hard-veto:LLM/manual/harness proposal 在进入 trial 前禁止重复 effective config;
- CandidateSet audit:
candidate_set_hash、eligible/blocked candidates、blocked reason summary; - sidecar persistence:
harness/candidate-set-*.json。
已通过的单 case:
TP8, DP1, gmu0.5, max-num-seqs8
-> TP4
-> TP4 + gmu0.9
这个 case 只能证明 sentinel recovery,不能证明分布级 robustness。
实验矩阵
使用同一 Qwen30B-A3B community vLLM 0.20 bounded replay setup、no-LLM harness、
search.auto_high.enabled=true。先跑 fresh trusted-start control,得到同 commit 下的
参考值 R_ref。
| Group | N | Initial starts | 证明点 |
|---|---|---|---|
| trusted control | 1 | 可信/default start | 定义 R_ref |
| topology-only | 4 | (TP,DP)=(8,1),(4,2),(1,4),(2,4),runtime nominal |
证明不是只会 TP8 -> TP4 |
| runtime-only | 4 | TP2/DP1 with gmu={0.50,0.70} and max-num-seqs={8,16} |
证明 runtime floor/admission recovery |
| combined | 4 | TP8/gmu0.70/mns16, TP4/DP2/gmu0.50/mns8, TP1/DP8/gmu0.50/mns16, TP2/DP4/gmu0.70/mns8 |
证明 operators 可串联 |
| held-out random | 8 | fixed-seed stratified samples over legal topology x low/nominal gmu x low/normal mns,排除已通过 sentinel |
overfit denominator |
总计:1 control + 20 novel bad starts。
Primary metrics
- best-so-far SLO-feasible
req/s/GPU / R_ref; - time-to-95%-reference;
- normalized AUC over trial budget;
- final pass rate;
- executed normalized full-config repeat count;
- no-op blocked count;
- candidate family / operator sequence;
- stop reason and
candidate_set_hash。
每个 run 必须保留:
state.json
proposals/*.json
harness/candidate-set-*.json
trials/trial-*/trial_spec.json
trials/trial-*/result.json
Pass/fail
Run-level pass:
best_so_far_req_per_gpu >= 0.95 * R_ref within 12 measured trials
pass_rate >= 0.95
executed_effective_config_repeat_count == 0
no harness stop while high-priority eligible candidates remain
Suite-level pass:
20 / 20 novel bad starts pass.
如果任一 novel start 失败,不能 claim distribution-level bad-start robustness。修复后必须 冻结失败分析,并重新抽 held-out random set。
Overfit guards
- pre-register all starts and random seed;
- 不把已通过的 exact
TP8,gmu0.5,mns8sentinel 计入 20-case denominator; - 不在 starts 之间调 threshold;
- 报告 operator names,例如
topology_bracket,topology_redistribute,runtime_floor_jump,admission_recovery,而不是 case-specific action; - 每次 stop 必须引用
candidate_set_hash和 no high-priority eligible candidate evidence。
GPU cost
Expected:
21 runs * 6-8 measured trials = 126-168 trials
Hard cap:
21 runs * 12 measured trials = 252 trials
按当前 Qwen30B bounded replay 粗估:
15-35 min / measured trial
expected = 250-780 H20 GPU-hours
cap = 500-1175 H20 GPU-hours
因此建议先跑 3-case pilot:
| Pilot case | 起点 | 目的 |
|---|---|---|
| topology-only | TP=4, DP=2, gmu=0.9, mns=64 |
检查不是只会处理 TP8 |
| runtime-only | TP=2, DP=1, gmu=0.5, mns=8 |
检查 runtime floor/admission recovery |
| combined | TP=1, DP=8, gmu=0.5, mns=16 |
检查 topology + runtime 串联 |
Pilot 通过后再启动完整 20-case suite。