Files
aituner/docs/harness-ablation/bad-start-robustness-suite-20260626.md

3.8 KiB
Raw Permalink Blame History

Bad-start robustness suite - 2026-06-26

本文定义 P0 bad-start robustness 的分布级验证。它不是新的 claim 结果,而是下一轮实验的 pre-registration先固定 starts、指标和 pass/fail再运行避免根据单个 case 调规则。

当前前提

已完成的代码 gate

  • normalized full-config signature
  • materialized effective signatureruntime-only proposal 先继承 incumbent topology 再签名;
  • CLI hard-vetoLLM/manual/harness proposal 在进入 trial 前禁止重复 effective config
  • CandidateSet auditcandidate_set_hash、eligible/blocked candidates、blocked reason summary
  • sidecar persistenceharness/candidate-set-*.json

已通过的单 case

TP8, DP1, gmu0.5, max-num-seqs8
  -> TP4
  -> TP4 + gmu0.9

这个 case 只能证明 sentinel recovery不能证明分布级 robustness。

实验矩阵

使用同一 Qwen30B-A3B community vLLM 0.20 bounded replay setup、no-LLM harness、 search.auto_high.enabled=true。先跑 fresh trusted-start control得到同 commit 下的 参考值 R_ref

Group N Initial starts 证明点
trusted control 1 可信/default start 定义 R_ref
topology-only 4 (TP,DP)=(8,1),(4,2),(1,4),(2,4)runtime nominal 证明不是只会 TP8 -> TP4
runtime-only 4 TP2/DP1 with gmu={0.50,0.70} and max-num-seqs={8,16} 证明 runtime floor/admission recovery
combined 4 TP8/gmu0.70/mns16, TP4/DP2/gmu0.50/mns8, TP1/DP8/gmu0.50/mns16, TP2/DP4/gmu0.70/mns8 证明 operators 可串联
held-out random 8 fixed-seed stratified samples over legal topology x low/nominal gmu x low/normal mns,排除已通过 sentinel overfit denominator

总计1 control + 20 novel bad starts。

Primary metrics

  • best-so-far SLO-feasible req/s/GPU / R_ref
  • time-to-95%-reference
  • normalized AUC over trial budget
  • final pass rate
  • executed normalized full-config repeat count
  • no-op blocked count
  • candidate family / operator sequence
  • stop reason and candidate_set_hash

每个 run 必须保留:

state.json
proposals/*.json
harness/candidate-set-*.json
trials/trial-*/trial_spec.json
trials/trial-*/result.json

Pass/fail

Run-level pass

best_so_far_req_per_gpu >= 0.95 * R_ref within 12 measured trials
pass_rate >= 0.95
executed_effective_config_repeat_count == 0
no harness stop while high-priority eligible candidates remain

Suite-level pass

20 / 20 novel bad starts pass.

如果任一 novel start 失败,不能 claim distribution-level bad-start robustness。修复后必须 冻结失败分析,并重新抽 held-out random set。

Overfit guards

  • pre-register all starts and random seed
  • 不把已通过的 exact TP8,gmu0.5,mns8 sentinel 计入 20-case denominator
  • 不在 starts 之间调 threshold
  • 报告 operator names例如 topology_bracket, topology_redistribute, runtime_floor_jump, admission_recovery,而不是 case-specific action
  • 每次 stop 必须引用 candidate_set_hash 和 no high-priority eligible candidate evidence。

GPU cost

Expected

21 runs * 6-8 measured trials = 126-168 trials

Hard cap

21 runs * 12 measured trials = 252 trials

按当前 Qwen30B bounded replay 粗估:

15-35 min / measured trial
expected = 250-780 H20 GPU-hours
cap      = 500-1175 H20 GPU-hours

因此建议先跑 3-case pilot

Pilot case 起点 目的
topology-only TP=4, DP=2, gmu=0.9, mns=64 检查不是只会处理 TP8
runtime-only TP=2, DP=1, gmu=0.5, mns=8 检查 runtime floor/admission recovery
combined TP=1, DP=8, gmu=0.5, mns=16 检查 topology + runtime 串联

Pilot 通过后再启动完整 20-case suite。