Bad-start robustness suite - 2026-06-26

本文定义 P0 bad-start robustness 的分布级验证。它不是新的 claim 结果，而是下一轮实验的 pre-registration：先固定 starts、指标和 pass/fail，再运行，避免根据单个 case 调规则。

当前前提

已完成的代码 gate：

normalized full-config signature；
materialized effective signature：runtime-only proposal 先继承 incumbent topology 再签名；
CLI hard-veto：LLM/manual/harness proposal 在进入 trial 前禁止重复 effective config；
CandidateSet audit：candidate_set_hash、eligible/blocked candidates、blocked reason summary；
sidecar persistence：harness/candidate-set-*.json。

已通过的单 case：

TP8, DP1, gmu0.5, max-num-seqs8
  -> TP4
  -> TP4 + gmu0.9

这个 case 只能证明 sentinel recovery，不能证明分布级 robustness。

实验矩阵

使用同一 Qwen30B-A3B community vLLM 0.20 bounded replay setup、no-LLM harness、 search.auto_high.enabled=true。先跑 fresh trusted-start control，得到同 commit 下的参考值 R_ref。

Group	N	Initial starts	证明点
trusted control	1	可信/default start	定义 `R_ref`
topology-only	4	`(TP,DP)=(8,1),(4,2),(1,4),(2,4)`，runtime nominal	证明不是只会 `TP8 -> TP4`
runtime-only	4	`TP2/DP1` with `gmu={0.50,0.70}` and `max-num-seqs={8,16}`	证明 runtime floor/admission recovery
combined	4	`TP8/gmu0.70/mns16`, `TP4/DP2/gmu0.50/mns8`, `TP1/DP8/gmu0.50/mns16`, `TP2/DP4/gmu0.70/mns8`	证明 operators 可串联
held-out random	8	fixed-seed stratified samples over legal topology x low/nominal `gmu` x low/normal `mns`，排除已通过 sentinel	overfit denominator

总计：1 control + 20 novel bad starts。

Primary metrics

best-so-far SLO-feasible req/s/GPU / R_ref；
time-to-95%-reference；
normalized AUC over trial budget；
final pass rate；
executed normalized full-config repeat count；
no-op blocked count；
candidate family / operator sequence；
stop reason and candidate_set_hash。

每个 run 必须保留：

state.json
proposals/*.json
harness/candidate-set-*.json
trials/trial-*/trial_spec.json
trials/trial-*/result.json

Pass/fail

Run-level pass：

best_so_far_req_per_gpu >= 0.95 * R_ref within 12 measured trials
pass_rate >= 0.95
executed_effective_config_repeat_count == 0
no harness stop while high-priority eligible candidates remain

Suite-level pass：

20 / 20 novel bad starts pass.

如果任一 novel start 失败，不能 claim distribution-level bad-start robustness。修复后必须冻结失败分析，并重新抽 held-out random set。

Overfit guards

pre-register all starts and random seed；
不把已通过的 exact TP8,gmu0.5,mns8 sentinel 计入 20-case denominator；
不在 starts 之间调 threshold；
报告 operator names，例如 topology_bracket, topology_redistribute, runtime_floor_jump, admission_recovery，而不是 case-specific action；
每次 stop 必须引用 candidate_set_hash 和 no high-priority eligible candidate evidence。

GPU cost

Expected：

21 runs * 6-8 measured trials = 126-168 trials

Hard cap：

21 runs * 12 measured trials = 252 trials

按当前 Qwen30B bounded replay 粗估：

15-35 min / measured trial
expected = 250-780 H20 GPU-hours
cap      = 500-1175 H20 GPU-hours

因此建议先跑 3-case pilot：

Pilot case	起点	目的
topology-only	`TP=4, DP=2, gmu=0.9, mns=64`	检查不是只会处理 TP8
runtime-only	`TP=2, DP=1, gmu=0.5, mns=8`	检查 runtime floor/admission recovery
combined	`TP=1, DP=8, gmu=0.5, mns=16`	检查 topology + runtime 串联

Pilot 通过后再启动完整 20-case suite。

3.8 KiB Raw Permalink Blame History Unescape Escape