gahow/aituner

Fork 0

Files

Gahow Wang 6c84dc91d7 Document hardened topology feedback

2026-06-29 02:34:12 +08:00

16 KiB

Raw Permalink Blame History

Prefill Scheduler Interaction Harness 设计与 Review

日期：2026-06-29

背景

case3 的 ablation 结果显示，gpt-5.5 no-harness 找到了一个 runtime/scheduler 方向：

enable-chunked-prefill=true
max-num-batched-tokens 较低/中等
max-num-seqs 适中
block-size=16

而当时 harness 主要做两类动作：

单点打开 enable-chunked-prefill；
对 max-num-batched-tokens 做单调 raise。

这个 gap 不能用“把 8192/32 这组值加入 candidate grid”来修补。那会把 case3 的答案硬编码成更大的候选表，仍然是 rule-based overfitting。

设计原则

新增的设计不是一个 fixed value set，而是一个 normalized control dimension：

prefill_quantum_ratio = max-num-batched-tokens / prompt_tokens_p95
admission_pressure    = max-num-seqs relative to trace.max_concurrency
scheduler_mode        = enable-chunked-prefill

因此，candidate generator 不直接说“试 8192”，而是说：

如果 long-tail prefill + TTFT/prefill bottleneck，且当前 prefill_quantum_ratio 太大，则沿 log-ratio 方向降低 prefill quantum；
如果 prefill quantum 远小于 prompt scale，可能是过度切碎导致 overhead，则沿 log-ratio 方向提高 prefill quantum；
如果 admission/queueing 是瓶颈，则只按 relative step 调整 admission pressure；
所有 concrete flag value 都是最后一步从 normalized target 映射到 engine flag，并按 engine granularity round。

当前实现使用几何中点作为 trust-region step：

target_mbt = sqrt(current_mbt * prompt_tokens_p95)

这对应在 log space 走半步。它比固定乘以 0.5/1.5 更接近 scale-invariant：prompt scale 变大时，下一步 MBT 也会变大。

Agent Loop

当前 harness 的 loop 可以形式化为：

trial result
  -> observation extractor
  -> bottleneck classifier
  -> candidate family selector
  -> normalized candidate generator
  -> scoring / coverage ranking
  -> validator / no-repeat / stop guard
  -> next trial

每一层承担不同责任：

observation extractor 只把 trial result 转成可比较的事实，例如 request_rate_per_gpu、pass_rate、失败原因、TTFT/TPOT 分位数。
bottleneck classifier 把事实归入 ttft_prefill、decode_tpot、 admission_or_queueing 等机制瓶颈，不直接输出配置值。
candidate family selector 决定要验证哪个系统假设，例如 topology frontier、 prefill scheduler、admission pressure 或 GPU memory headroom。
normalized candidate generator 才把机制变量映射成具体 engine flag。
scoring / coverage ranking 负责排序：未覆盖但机制上相关的维度应优先于已知方向上的微调。
validator 使用 normalized full-config signature 防止重复测试，并用 stop guard 避免在仍有高价值 falsification candidate 时过早停止。

因此，harness 的核心不是“把 LLM prompt 写好”，而是把黑盒搜索拆成带因果方向的 white-box falsification loop。LLM 可以参与生成候选或解释候选，但候选必须通过 harness 的 family、signature、scoring 和 validator 约束。

实现映射

代码入口：

src/aituner/harness.py::_runtime_candidate_actions
- 在 topology frontier settled 后调用新的 _prefill_scheduler_candidate_actions。
- 仍保留 topology-before-runtime guard，runtime family 不抢未覆盖的 topology frontier。

新增逻辑：

_prefill_scheduler_workload_applies
- 只在非 decode-only、long-tail/moderate-tail prefill workload、非 high-prefix-reuse 场景激活。
_next_prefill_quantum_step
- 使用 current_mbt / prompt_scale 判断方向。
- 通过几何中点做相对 step。
_next_admission_pressure_step
- 使用 max-num-seqs / trace.max_concurrency 作为 normalized admission pressure。
- 当 admission/queueing 受限且 admission pressure 过低时 raise。
- 当 TTFT/prefill 受限且 admission pressure 明显高于 trace concurrency scale 时 lower。
_prefill_scheduler_candidate_actions
- 输出 prefill-scheduler-interaction family。
- score_factors 显式记录 current/target prefill_quantum_ratio，方便后续实验解释。
- score_factors 同时记录 current/target admission pressure ratio，避免只解释 MBT。
- 当 scheduler dimension 还没有被 materialized config 覆盖时，加入 uncovered_scheduler_dimension_bonus，让该 family 在 topology settled 后优先于 gpu-memory-utilization 这类 resource micro-tuning。
当该 family 已生成有效候选时，旧的 standalone raise_mbt、 enable_chunked_prefill、raise_mbt_and_max_num_seqs 只作为 fallback，不作为同级 prefill runtime 候选抢排序。
gpu-memory-utilization 仍保留小步 hill-climb，但继续爬升必须由同拓扑 request_rate_per_gpu 改善支撑；仅仅 launch 成功或打平 incumbent 不再算成功。

为什么不是 rule-based hack

禁止的实现形态：

不允许引用 case3、具体 trace 名、模型名、机器名；
不允许出现 if TP=2 and gmu=0.7 and mns=8 then MBT=8192；
不允许把 case3 发现扩成 {4096,8192,12288,16384} x {16,32,64} 这种固定 grid；
不允许 bypass normalized full-config signature。

当前实现满足：

trigger 来自 L-C-A profile、bottleneck classifier、topology frontier、tunable flags；
proposal 是相对当前 incumbent 的 direction，不是固定答案；
concrete value 随 prompt scale 和 current config 改变；
validator/no-repeat 仍使用 normalized effective full-config signature；
runtime gate 和正式 topology frontier 共用 higher-TP frontier patch 构造，避免 DP 非 base 时 scheduler 抢跑；
short prompt、decode-only、high prefix reuse 不激活该 family。

但这不是完备性证明。当前能 claim 的是更严格的工程性质：

不引用特定 case identity；
不把已知 winner 写成候选表；
每个 concrete proposal 都能追溯到一个 normalized mechanism variable；
每次 trial 都能被解释成对一个系统假设的 falsification；
失败时会留下可审计的 candidate sequence 和 score factors。

Review 结论

之前实现的问题

enable-chunked-prefill 是 standalone toggle，无法表达 scheduler interaction。
TTFT/prefill bottleneck 下 MBT 主要单调 raise，无法发现“降低 prefill quantum 减少 HoL blocking”。
旧测试断言了固定 16384 等值，容易把 harness 叙事拉回 heuristic table。

当前改动的效果

引入 prefill-scheduler-interaction 作为新的 mechanistic family。
candidate 的 action id 表达方向：
- lower_prefill_quantum_with_chunked_prefill
- raise_prefill_quantum_with_chunked_prefill
- seed_chunked_prefill_quantum
- adjust_admission_pressure_with_chunked_prefill
测试改为验证 normalized direction 和 scale sensitivity，而不是固定 absolute value。

当前实现仍需警惕的风险

_PREFILL_QUANTUM_HEAD_OF_LINE_RATIO=1.0 和 _PREFILL_QUANTUM_FRAGMENTATION_RATIO=0.5 仍是机制阈值，不是定理。它们必须通过 scaled prompt / negative workload 实验验证，而不能只靠 case3。
uncovered_scheduler_dimension_bonus 是 coverage 排序策略。它的合理性来自 “先覆盖未 materialized 的机制维度，再做 GMU 微调”，但必须通过 candidate sequence 证明它不会在 topology frontier 未覆盖时抢跑。
block-size=16 目前没有被纳入这个 family。不能把它作为 case3 固定答案加入；如果后续要处理，需要单独设计 allocator/layout family，从 engine capability 和 memory block waste observation 推导，而不是在 prefill scheduler family 里硬塞。
现有实现仍保留旧的 standalone enable-chunked-prefill 和 raise_mbt 路径作为 fallback。它们不能在 prefill-scheduler-interaction 已生成有效候选时作为同级 prefill runtime 候选抢排序。

2026-06-29 独立 review 后的修正

独立 review 指出了三个需要立即收紧的泛化风险：

旧 standalone MBT/chunked 候选仍可能让整体 harness 表现得像 heuristic table。
admission pressure 只有 raise，没有处理 max-num-seqs 过高导致 TTFT/prefill 干扰。
runtime gate 的 topology-settled 判断和正式 topology frontier 在 DP 非 base 时不完全一致。

对应修正：

当 prefill-scheduler-interaction 有有效候选时，旧的 standalone MBT/chunked/joint prefill-runtime 候选降为 fallback。
admission pressure 改为 normalized ratio，并支持 raise/lower 两个方向： raise_admission_pressure_with_chunked_prefill 和 lower_admission_pressure_with_chunked_prefill。
抽出 _higher_tp_frontier_patch，让 runtime gate 与 _topology_frontier_status 使用同一套 higher-TP signature。
GMU hill-climb 改为 measurement-gated：同拓扑 GMU trial 没有提升 request_rate_per_gpu 时，阻断继续向更高 GMU 爬升，避免连续浪费 trials。

2026-06-29 远端 review feedback

在 dash1 用 36c301c 启动 case3 bad-runtime 重跑后，trial-0003 的 candidate-set 已经出现 prefill-scheduler-interaction：

action_id = seed_chunked_prefill_quantum
patch     = enable-chunked-prefill=true, max-num-batched-tokens=8192
ratio     = target prefill_quantum_ratio ~= 1.05

但初始 scoring 仍让 raise_gpu_memory_utilization 排在前面。这说明 family 接入是正确的，但排序仍偏向 resource micro-tuning。随后实现加入 uncovered_scheduler_dimension_bonus：当 topology frontier 已覆盖、prefill scheduler dimension 还没有被 materialized config 测过时，优先测试 scheduler hypothesis，避免重复旧 harness 先爬 GMU 的失败轨迹。

单元验证

新增/更新的测试覆盖：

long-tail TTFT 下，过大的 prefill_quantum_ratio 会下降；
prompt length scale 变大时，下一步 MBT target 也变大；
topology frontier 已覆盖后，未覆盖的 scheduler dimension 排在 GMU 微调之前；
short prompt workload 不激活 prefill scheduler family；
原有 prefill stop guard 仍不允许在有 high-value candidate 时停止；
normalized full-config no-repeat 语义不变。

本地全量测试：

PYTHONPATH=src python3 -m unittest discover -s tests
156 tests OK

本地重点回归：

PYTHONPATH=src python3 -m unittest \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_coverage_precedes_gmu_microtune \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_admission_pressure_only_uses_normalized_seq_cap \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_lowers_excess_admission_pressure \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_negative_applicability_matrix \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_does_not_preempt_open_topology_frontier \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_lowers_quantum_by_normalized_ratio \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_quantum_step_scales_with_prompt_length \
  tests.test_core_flow.CoreFlowTests.test_prefill_scheduler_not_active_for_short_prompt_workload
8 tests OK

还需要真机实验验证

下一步实验不应该只看 case3 是否复现，而要攻击这个 family 的边界：

case3 bad runtime start：
- 目标：验证 LLM+harness / no-LLM harness 是否能从 bad runtime start 找到 chunked-prefill scheduler 方向。
scaled prompt case：
- 目标：验证 proposal 不固定在同一个 MBT，而会随 prompt_tokens_p95 改变。
short/decode negative case：
- 目标：验证该 family 不会在不适用 workload 上误触发。
topology frontier case：
- 目标：验证 topology 未覆盖时 runtime scheduler 不抢跑。

核心指标：

best request_rate_per_gpu；
time-to-best / trial-to-target；
candidate family sequence；
prefill_quantum_ratio_current -> target 的方向是否与 bottleneck evidence 一致；
是否出现 repeated normalized full-config signature。

当前 dash1 真机状态

当前正在验证提交 bfd8579：

run     = .aituner/badstart-prefill-scheduler-bfd8579-20260628T173102Z
case    = badstart-expanded-9accf25-20260626T184911Z-runtime_tp2_dp1_gmu070_mns8
session = aituner-prefill-scheduler-case3-bfd8579

截至 2026-06-29 01:53 UTC+8 左右：

baseline trial-0001 已完成，best request_rate_per_gpu 约为 2.025；
trial-0002 TP4 topology frontier probe 已完成，best request_rate_per_gpu 约为 2.000，没有超过 baseline；
candidate-set-0002 的 top action 是 topology frontier，符合 topology-before-runtime；
candidate-set-0003 的 top action 已变为 seed_chunked_prefill_quantum：

score    = 0.69
patch    = enable-chunked-prefill=true, max-num-batched-tokens=8192
ratio    = prefill_quantum_ratio_target ~= 1.0536
baseline = raise_gpu_memory_utilization score 0.64

这说明 uncovered_scheduler_dimension_bonus 达到了设计目的：topology frontier 覆盖后，未 materialized 的 scheduler dimension 会先于 GMU 微调被验证。

trial-0003 已完成，best request_rate_per_gpu 约为 2.025，和 baseline 持平，没有形成性能提升。这个结果不能 claim scheduler seed 是 winner，但它提供了有价值的 falsification evidence：coverage priority 改变了探索顺序，具体 chunked + MBT ~= p95 hypothesis 被验证后没有改进。系统随后进入 candidate-set-0004，开始测试 gpu-memory-utilization=0.9。trial-0004 同样完成在约 2.025，没有超过 baseline； trial-0005 的 gpu-memory-utilization=0.92 仍然打平 baseline，旧 run 随后继续排 gpu-memory-utilization=0.94。这暴露出旧实现的 GMU hill-climb 问题：它把 launch 成功当成 climb 成功，而没有要求 request_rate_per_gpu 改善。最新本地实现已经修正为 measurement-gated GMU climb；下一轮应使用新提交重新跑，验证 GMU tie 后是否转向 admission pressure、topology/DP 或其他 family。

Hardened Run Feedback

使用提交 6b25d56 在 dash1 重新启动：

run     = .aituner/badstart-prefill-hardened-6b25d56-20260628T180104Z
case    = badstart-expanded-9accf25-20260626T184911Z-runtime_tp2_dp1_gmu070_mns8
session = aituner-prefill-hardened-6b25d56

截至 2026-06-29 02:27 UTC+8 左右，同一 run 内的 trial sequence 是：

trial	patch	request_rate_per_gpu	observation
0001	baseline bad-start	2.983	同 run incumbent，明显高于旧 run baseline，说明跨 run 数字不能直接混用
0002	`tensor-parallel-size=4`	1.629	topology TP4 被 falsify
0003	`enable-chunked-prefill=true, max-num-batched-tokens=8192`	2.025	standalone scheduler seed 被 falsify
0004	`gpu-memory-utilization=0.9`	3.258	GMU=0.9 是当前 best，达到已知 no-harness 水平
0005	GMU=0.9 + scheduler seed	2.025	GMU 与 scheduler seed 的组合被 falsify
0006	`gpu-memory-utilization=0.92`	3.258	与 GMU=0.9 打平，没有继续提升
0007	`tensor-parallel-size=4, data-parallel-size=2`	1.000	DP/topology probe 被 falsify

candidate-set-0007 没有继续提出 gpu-memory-utilization=0.94，而是转向 tensor-parallel-size=4, data-parallel-size=2 topology probe。这验证了 measurement-gated GMU climb：GMU=0.92 只是打平时，不再继续向更高 GMU 盲目爬升。 candidate-set-0008 在 TP4/DP2 被 falsify 后继续测试 tensor-parallel-size=8。

当前最重要的机制结论：

scheduler seed 的 priority 和 no-repeat 都按设计工作；
scheduler seed 在这个 case 不是独立 winner，必须被 measurement falsify；
GMU=0.9 是当前真正有效的机制维度；
GMU 的后续 climb 已经从 launch-gated 修正为 improvement-gated；
后续应看 topology/DP、MNS 或 allocator/layout family 是否能进一步超过 3.258。

16 KiB Raw Permalink Blame History Unescape Escape