gahow/aituner

Fork 0

Files

Gahow Wang 384cb58f1f Add declarative harness prototype

2026-06-26 18:07:02 +08:00

5.3 KiB

Raw Blame History

Bad-start stop counterexample - 2026-06-26

本文记录一次有意构造的 adversarial bad-start 测试。它的目的不是证明 harness 已经 robust，而是攻击当前实现，确认它是否会从明显不合理的初始配置中恢复。

结论：

当前 production/prototype harness 还不能支持 bad-start robustness claim。

它会在高 GPU、高 TP 的坏起点上被 search_high_saturated_by_incumbent 提前 stop，
没有测试 topology/resource-efficiency contrast。

这不是一个需要补 TP=8 -> TP=4 特例规则的问题。它暴露的是更基础的 stop authority 问题：measurement saturation 不能绕过 coverage-relative candidate set。

实验设置

机器：dash1，8x H20。

目标：从一个故意不合理的初始配置开始：

tensor-parallel-size = 8
data-parallel-size = 1
gpu-memory-utilization = 0.5
max-num-seqs = 8
LLM endpoint disabled

期望行为：

harness 不应只因为 baseline feasible 就停止；
它至少应生成 topology/resource-efficiency contrast candidate；
对 req/s/GPU 目标，8 GPU incumbent 需要被低 GPU 或邻域 topology probe 验证。

Run A: 低 search.high

第一轮保留原始 search.high=0.125。

结果：

trial-0001 completed
harness-stop-0002
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best request_rate = 1.0333 total
best request_rate_per_gpu = 0.1292
pass_rate = 1.0

解释：这个 run 的 offered-load ceiling 太低，baseline 很容易 saturate search.high。因此它不能区分“配置真的足够好”和“测量上限太低”。

Run B: corrected high search ceiling

第二轮把 search.high 提到 1.0，保留同一个 bad-start 配置，max_trials=3。

远端产物：

session = adv_badcase_corr_casea_20260626T095356Z
store = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z
spec = /home/admin/cpfs/wjh/aituner/aituner/.aituner-run-configs/adversarial-badcase-corrected-casea-20260626T095356Z/casea-combined-bad-highsearch.json
log = /home/admin/cpfs/wjh/aituner/aituner/.aituner/adversarial-badcase-corrected-casea-20260626T095356Z.log

结果仍然是在 baseline 后 stop：

trial-0001 completed
harness-stop-0002
no harness-proposal-0002.json
tuning_stop_reason = harness_stop
validator reason = search_high_saturated_by_incumbent
best sampling_u = 0.9375
best request_rate = 8.033333333333333
best request_rate_per_gpu = 1.0041666666666667
pass_rate = 1.0

Probe trace：

sampling_u	request_rate	feasible
0.5	4.6000	true
0.75	6.5167	true
0.875	7.5000	true
0.9375	8.0333	true

它触发 stop 的原因是当前 guard 计算：

binary_probe_resolution = max(tolerance, (high - low) / 2**max_probes)
                        = 0.0625
threshold_gap_to_high = 1.0 - 0.9375
                      = 0.0625

因此当前实现认为 incumbent 已经 saturate search.high。

为什么这是反例

当前 objective 是 SLO-constrained req/s/GPU，不是固定 8 GPU 的 total throughput。一个 8-GPU incumbent saturate offered-load ceiling，并不能证明：

低 TP / 低 GPU 配置没有更高 req/s/GPU；
当前 topology 是资源效率最优；
runtime knobs 已经进入合适 trust region；
no-LLM harness 能从 bad start 中恢复。

所以这个 stop 是 unsound 的，至少相对于 bad-start robustness claim 是 unsound。

更形式化地说：

search_high_saturated_by_incumbent
  does not imply
incumbent_validated(topology/resource-efficiency)

当目标包含 resource efficiency，并且 parallel-size/topology 仍然 tunable 时， search_high_saturated_by_incumbent 只能作为 measurement evidence，不能单独作为 stop authority。

对新 harness 设计的约束

这个反例直接约束 declarative harness：

Stop 前必须生成并持久化完整 CandidateSet。
Stop proof 必须引用 candidate_set_hash。
如果存在未覆盖的 high-priority topology/resource-efficiency candidate，validator 必须返回 eligible_candidates_remain，即使 incumbent saturate search.high。
search.high saturation 只能更新 measurement coverage，不能替代 incumbent_validated。
对 req/s/GPU objective，required coverage 必须包含至少一个 topology 或 resource-efficiency contrast，除非 StudySpec 明确固定 GPU budget 和 topology。

这也说明当前 repair 方向不能是：

if tp == 8 and gmu == 0.5: try tp = 4

正确方向应该是：

ordered topology lattice + resource-efficiency objective
  -> candidate set includes lower/redistributed topology contrast
  -> stop is blocked until that coverage unit is measured or invalidated

当前 verdict

当前 production harness：

prototype, not yet fundamental

新的 declarative prototype：

promising substrate, but not production-proven

它已经把 CandidateSet、CoverageUnit、failure region 和 coverage-relative stop 的最小接口跑通，但还没接入真实 tuning loop，也还没证明 bad-start distribution 的收敛。

因此接下来的 P0 gate 是：

先实现 coverage-relative stop authority，再重跑 bad-start distribution。

5.3 KiB Raw Blame History Unescape Escape