Document bad-start validation results
This commit is contained in:
@@ -194,3 +194,90 @@ promising substrate, but not production-proven
|
||||
```text
|
||||
先实现 coverage-relative stop authority,再重跑 bad-start distribution。
|
||||
```
|
||||
|
||||
## 2026-06-26 implementation validation
|
||||
|
||||
Commit `c8a0f98` 实现了第一片 production 修复:
|
||||
|
||||
- `search.auto_high` schema,默认关闭,旧配置兼容;
|
||||
- trial materialization 时在已有 trace sampling space 内 resolve effective `search.high`;
|
||||
- `trial_spec.json` 和 `result.json` 写入 auto-high / measurement evidence;
|
||||
- `search_high_saturated_by_incumbent` 降级为 measurement evidence;
|
||||
- 对 `req/s/GPU` 且 topology 可变的 study,high saturation 不能直接授权 stop;
|
||||
- 固定 GPU product 但 TP/DP redistribution 可调时,仍视为 topology 可变;
|
||||
- auto-high ceiling 低于 `search.low` 时不生成非法 search interval。
|
||||
|
||||
本地验证:
|
||||
|
||||
```text
|
||||
PYTHONPATH=src python3 -m unittest discover -s tests
|
||||
Ran 143 tests OK
|
||||
```
|
||||
|
||||
dash1 validation:
|
||||
|
||||
```text
|
||||
run label = adversarial-badstart-autohigh-c8a0f98-20260626T122622Z
|
||||
git sha = c8a0f9870eac5438fb19be8edf1534a893723ab9
|
||||
machine = dash1, 8x H20
|
||||
```
|
||||
|
||||
Spec 仍使用 bad-start:
|
||||
|
||||
```text
|
||||
tensor-parallel-size = 8
|
||||
data-parallel-size = 1
|
||||
gpu-memory-utilization = 0.5
|
||||
max-num-seqs = 8
|
||||
search.auto_high.enabled = true
|
||||
```
|
||||
|
||||
Auto-high resolution:
|
||||
|
||||
```text
|
||||
original_high = 1.0
|
||||
effective_high = 0.9979913161468553
|
||||
trace_max_sampling_u = 0.9979913161468553
|
||||
reason = search_high_lowered_to_trace_ceiling
|
||||
```
|
||||
|
||||
结果:
|
||||
|
||||
| trial | config patch | best sampling_u | request_rate | req/s/GPU | pass |
|
||||
| --- | --- | ---: | ---: | ---: | ---: |
|
||||
| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
|
||||
| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9784 |
|
||||
| trial-0003 | `tensor-parallel-size=8` | 0.935616858887 | 8.00 | 1.0000 | 1.0000 |
|
||||
|
||||
关键结论:
|
||||
|
||||
```text
|
||||
旧 failure 已被修复:
|
||||
baseline 后不再产生 harness-stop-0002/search_high_saturated_by_incumbent。
|
||||
|
||||
新实现产生 harness-proposal-0002,并测试 TP4 topology contrast。
|
||||
TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。
|
||||
```
|
||||
|
||||
这证明第一片修复解决了“measurement saturation 绕过 topology coverage”的问题。
|
||||
|
||||
但是 trial-0003 暴露了新 blocker:
|
||||
|
||||
```text
|
||||
当前 no-repeat 仍基于 patch signature,而不是 normalized full-config signature。
|
||||
```
|
||||
|
||||
`tensor-parallel-size=8` 对这个 study 的 base config 是 no-op,等价于 baseline TP8,
|
||||
但系统仍把它当成一个新 proposal 执行。这说明下一片 P0 必须实现:
|
||||
|
||||
1. normalized full-config signature;
|
||||
2. CandidateSet snapshot,包含 eligible 和 blocked candidates;
|
||||
3. blocked reason,例如 `blocked_noop_equivalent_to_tested_full_config`;
|
||||
4. Stop/report 中同时呈现 `measurement_ceiling_*` 和 `eligible_candidates_remain`。
|
||||
|
||||
因此当前 verdict 更新为:
|
||||
|
||||
```text
|
||||
P0 measurement/stop-order slice: passed.
|
||||
P0 full coverage-relative harness: not yet passed.
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user