From 42f75553a6439f7561a0cecbe8dbc24baa69ab1b Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Fri, 26 Jun 2026 21:52:18 +0800 Subject: [PATCH] Document full config signature validation --- docs/aituner-roadmap.md | 5 +- .../bad-start-stop-counterexample-20260626.md | 89 +++++++++++++++++++ 2 files changed, 92 insertions(+), 2 deletions(-) diff --git a/docs/aituner-roadmap.md b/docs/aituner-roadmap.md index de14030..0dd0f00 100644 --- a/docs/aituner-roadmap.md +++ b/docs/aituner-roadmap.md @@ -83,7 +83,7 @@ kernel、KV cache、通信和排队的闭式性能模型。更稳妥也更强的 | C5. AITuner 找到 near-optimal region,而不是只找到一个可行 config | Qwen30B 有解释性信号 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 选 1-2 个 case 做局部 grid 或专家配置对照 | | C6. AITuner 能随 SLO tightness 移动到合适 frontier | Qwen30B 已完成 | [Qwen30B SLO robustness](harness-ablation/qwen30b-slo-robustness-20260624.md) | 再选一个非同质 case 做 SLO sweep;同时画 SLO tightness -> frontier/regime transition | | C7. Engine adapter 让 intervention grammar 可迁移到其他 serving engine | 设计上可行,暂不作为主实验 claim | `EngineLaunchSpec` / launch recipe / tunable schema | vLLM 主线完成后,再做 SGLang adapter 和一个低成本验证 case | -| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 当前发现反例,不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator + coverage-relative stop 后跑 random/adversarial start distribution | +| C8. Harness 对坏初始点有恢复能力,不只依赖可信 base config | 单个 adversarial bad-start 已通过 first repair;分布级 robustness 不能 claim | [Declarative intervention harness design](harness-ablation/declarative-intervention-harness-design-20260626.md), [Bad-start stop counterexample](harness-ablation/bad-start-stop-counterexample-20260626.md), [No-LLM harness mechanism](harness-ablation/no-llm-harness-mechanism-20260625.md) | 重构为 grammar/operator + coverage-relative stop 后跑 random/adversarial start distribution | ## 最高优先级实验 @@ -103,7 +103,8 @@ declarative intervention grammar + coverage-relative validator。 `search.high=1.0` 仍然不足,必须报告 `measurement_ceiling_insufficient` 并等待人类 确认,不得静默重复窗口或合成 arrivals; - normalized full-config signature:no-repeat 不能只看 patch signature;base config 与 - no-op patch 必须被识别为同一 full config; + no-op patch 必须被识别为同一 full config;`48911b6` 已实现并在 dash1 bad-start + validation 中通过; - Failure invalidation 有保守 region predicate 和 retry/unblock 条件; - grammar/policy/capability 都有 version 和 anti-overfitting static checks; - LLM/BO 只能选择合法 candidate,不能绕过 validator。 diff --git a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md index 861efa2..6183fb2 100644 --- a/docs/harness-ablation/bad-start-stop-counterexample-20260626.md +++ b/docs/harness-ablation/bad-start-stop-counterexample-20260626.md @@ -281,3 +281,92 @@ TP4 将 best req/s/GPU 从 1.0000 提高到 1.7375。 P0 measurement/stop-order slice: passed. P0 full coverage-relative harness: not yet passed. ``` + +## 2026-06-26 normalized full-config validation + +Commit `48911b6` 修复了上一节暴露的新 blocker:no-repeat 不再只比较 patch +signature,而是比较 normalized effective full config。 + +实现语义: + +```text +effective_config = + normalize(base_envs + env_patch, + base_flags + flag_patch) + +no_repeat_signature = stable_json(effective_config) +``` + +因此下面两个 proposal 在 validator 看来是同一个 full config: + +```text +baseline patch: {} +noop patch: {"tensor-parallel-size": 8} +``` + +本地验证: + +```text +PYTHONPATH=src python3 -m unittest discover -s tests +Ran 145 tests OK +``` + +dash1 validation: + +```text +run label = adversarial-badstart-fullsig-48911b6-20260626T133112Z +git sha = 48911b658bbf052d70d952d1cdf55ad6b50ba7a5 +machine = dash1, 8x H20 +``` + +Spec 仍使用同一个 adversarial bad-start: + +```text +tensor-parallel-size = 8 +data-parallel-size = 1 +gpu-memory-utilization = 0.5 +max-num-seqs = 8 +search.auto_high.enabled = true +LLM endpoint disabled +``` + +结果: + +| trial | proposal | best sampling_u | request_rate | req/s/GPU | pass | +| --- | --- | ---: | ---: | ---: | ---: | +| trial-0001 | baseline TP8, DP1, gmu0.5, mns8 | 0.935616858887 | 8.00 | 1.0000 | 1.0000 | +| trial-0002 | `tensor-parallel-size=4` | 0.810867944369 | 6.95 | 1.7375 | 0.9832 | +| trial-0003 | `tensor-parallel-size=4`, `gpu-memory-utilization=0.9` | 0.935616858887 | 8.00 | 2.0000 | 1.0000 | + +关键 observation: + +```text +旧 trial-0003: + {"tensor-parallel-size": 8} + -> 等价于 baseline,但仍被执行 + +新 trial-0003: + {"tensor-parallel-size": 4, "gpu-memory-utilization": 0.9} + -> 在已验证 TP4 topology 上继续测试 KV/cache headroom +``` + +这证明 normalized full-config signature 已经阻止了 patch-level no-op 重测。 + +机制解释: + +1. baseline TP8 saturate search ceiling 只被记录为 measurement evidence; +2. 因为 objective 是 `req/s/GPU`,topology/resource-efficiency contrast 仍未覆盖,所以 + validator 不允许 stop; +3. harness 先测试相邻低 TP topology,TP4 把 `req/s/GPU` 从 `1.0` 提高到 `1.7375`; +4. no-repeat 用 full config signature block 掉等价 TP8 patch; +5. harness 在 settled TP4 topology 上继续测试 runtime headroom,`gmu=0.9` 把 + `req/s/GPU` 提高到 `2.0`。 + +当前 verdict 更新为: + +```text +P0 measurement/stop-order slice: passed. +P0 normalized full-config no-repeat slice: passed. +P0 single adversarial bad-start recovery: passed for this case. +P0 distribution-level bad-start robustness: not yet proven. +```