diff --git a/docs/aituner-roadmap.md b/docs/aituner-roadmap.md index aa6f17e..de14030 100644 --- a/docs/aituner-roadmap.md +++ b/docs/aituner-roadmap.md @@ -102,6 +102,8 @@ declarative intervention grammar + coverage-relative validator。 - 加入 `auto_search_high` measurement policy:可在已有 trace 内自动提高 ceiling;若 `search.high=1.0` 仍然不足,必须报告 `measurement_ceiling_insufficient` 并等待人类 确认,不得静默重复窗口或合成 arrivals; +- normalized full-config signature:no-repeat 不能只看 patch signature;base config 与 + no-op patch 必须被识别为同一 full config; - Failure invalidation 有保守 region predicate 和 retry/unblock 条件; - grammar/policy/capability 都有 version 和 anti-overfitting static checks; - LLM/BO 只能选择合法 candidate,不能绕过 validator。 diff --git a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md index aaae5bd..861efa2 100644 --- a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md +++ b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md @@ -194,3 +194,90 @@ promising substrate, but not production-proven ```text 先实现 coverage-relative stop authority,再重跑 bad-start distribution。 ``` + +## 2026-06-26 implementation validation + +Commit `c8a0f98` 实现了第一片 production 修复: + +- `search.auto_high` schema,默认关闭,旧配置兼容; +- trial materialization 时在已有 trace sampling space 内 resolve effective `search.high`; +- `trial_spec.json` 和 `result.json` 写入 auto-high / measurement evidence; +- `search_high_saturated_by_incumbent` 降级为 measurement evidence; +- 对 `req/s/GPU` 且 topology 可变的 study,high saturation 不能直接授权 stop; +- 固定 GPU product 但 TP/DP redistribution 可调时,仍视为 topology 可变; +- auto-high ceiling 低于 `search.low` 时不生成非法 search interval。 + +本地验证: + +```text +PYTHONPATH=src python3 -m unittest discover -s tests +Ran 143 tests OK +``` + +dash1 validation: + +```text +run label = adversarial-badstart-autohigh-c8a0f98-20260626T122622Z +git sha = c8a0f9870eac5438fb19be8edf1534a893723ab9 +machine = dash1, 8x H20 +``` + +Spec 仍使用 bad-start: + +```text +tensor-parallel-size = 8 +data-parallel-size = 1 +gpu-memory-utilization = 0.5 +max-num-seqs = 8 +search.auto_high.enabled = true +``` + +Auto-high resolution: + +```text +original_high = 1.0 +effective_high = 0.9979913161468553 +trace_max_sampling_u = 0.9979913161468553 +reason = search_high_lowered_to_trace_ceiling +``` + +结果: + +| trial | config patch | best sampling_u | request_rate | req/s/GPU | pass | +| --- | --- | ---: | ---: | ---: | ---: | +| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 | +| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9784 | +| trial-0003 | `tensor-parallel-size=8` | 0.935616858887 | 8.00 | 1.0000 | 1.0000 | + +关键结论: + +```text +旧 failure 已被修复: +baseline 后不再产生 harness-stop-0002/search_high_saturated_by_incumbent。 + +新实现产生 harness-proposal-0002,并测试 TP4 topology contrast。 +TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。 +``` + +这证明第一片修复解决了“measurement saturation 绕过 topology coverage”的问题。 + +但是 trial-0003 暴露了新 blocker: + +```text +当前 no-repeat 仍基于 patch signature,而不是 normalized full-config signature。 +``` + +`tensor-parallel-size=8` 对这个 study 的 base config 是 no-op,等价于 baseline TP8, +但系统仍把它当成一个新 proposal 执行。这说明下一片 P0 必须实现: + +1. normalized full-config signature; +2. CandidateSet snapshot,包含 eligible 和 blocked candidates; +3. blocked reason,例如 `blocked_noop_equivalent_to_tested_full_config`; +4. Stop/report 中同时呈现 `measurement_ceiling_*` 和 `eligible_candidates_remain`。 + +因此当前 verdict 更新为: + +```text +P0 measurement/stop-order slice: passed. +P0 full coverage-relative harness: not yet passed. +```