aituner/docs/harness-ablation/no-llm-harness-mechanism-20260625.md

# No-LLM Harness Mechanism - 2026-06-25

Status note, 2026-06-26:

本文记录的是当前 rule-based prototype harness 的 no-LLM 机制和已有实验现象。它能证明
AITuner 可以在没有 LLM endpoint 的情况下闭环运行，但不能证明 harness 的完备性、
通用 robustness 或最终系统贡献。最终目标设计已经调整为 declarative intervention
grammar + coverage-relative validator，见
[`declarative-intervention-harness-design-20260626.md`](declarative-intervention-harness-design-20260626.md)。

本文回答一个核心问题：如果不调用 LLM，harness 为什么还能自动找到配置？

结论先说清楚：no-LLM 模式下并不是“没有 planner”。当前 harness 本身就是一个
deterministic planner。LLM 在 AITuner 里只是一个可替换的 proposal backend；当
harness 能从观测、瓶颈归因、候选 family 和 stop validator 中推出下一步时，tuning
loop 会直接使用 harness proposal，而不会请求 LLM。

## Tune loop 中 LLM 的位置

`study tune` 每轮的决策顺序是：

```text
state + study spec + workload/probe results
        |
        v
build_harness_context(...)
        |
        +--> build_harness_stop_proposal(context)
        |       if true: write harness-stop and exit
        |
        +--> build_harness_guided_proposal(context)
        |       if true: run this deterministic proposal
        |
        +--> call_llm_for_proposal(...)
                only if no harness stop/proposal exists
```

因此在 `study.llm.endpoint = null` 的 no-LLM run 中，只要 harness 每轮都能给出
一个 deterministic proposal 或 deterministic stop，整个实验就可以完全不调用 LLM。
如果 harness 既不能 propose 也不能 stop，且没有 LLM endpoint，AITuner 会报错，而不是
偷偷退化成随机搜索。

当前 Qwen30B stopfix run 就是这种完整闭环：

```text
.aituner/qwen30b-harness-only-medium-stopfix-dash1-20260624T144701Z/
```

它没有 LLM endpoint，但仍完成了 9 个 measured trials，并最终由 validator 写出
`harness_stop`。

## Harness 做的不是 prompt engineering

Harness 做的事情可以形式化成：

```text
H = (O, B, G, S, V)

O: Observation schema
   将 workload、trial probes、SLO failure、launch failure、topology constraints
   转成结构化状态。

B: Bottleneck attribution
   将 SLO violation 归因到 serving regime，例如 ttft_prefill、decode_tpot、
   admission_or_queueing、launch_or_memory。

G: Intervention grammar
   将 raw knobs 组织成有语义的 candidate families，例如 topology、batching、
   sequence admission、KV memory headroom。

S: Scoring policy
   对候选 intervention 评分，选择最有信息量且最可能提升 SLO-constrained
   req/s/GPU 的下一步。

V: Validator / stop policy
   阻止非法、重复、已知失败或无意义的 proposal；只有在剩余高价值候选被测完后
   才允许 stop。
```

LLM 可以读取这些结构化信息并生成 proposal，但 no-LLM 时 `H` 自己就能生成
proposal。换句话说，我们的核心是把：

```text
raw config vector search
```

转成：

```text
mechanism-guided intervention search
```

这就是为什么没有 LLM 也能工作。

## Agent loop 流程图

```mermaid
flowchart TD
  A[Baseline or latest measured trial] --> B[Load probe history and trial result]
  B --> C[Build workload L-C-A profile]
  B --> D[Build TrialProfile]
  C --> E[Rank bottleneck hypotheses]
  D --> E
  E --> F[Generate legal candidate actions]
  F --> G[Score candidates]
  G --> H{High-value candidate?}
  H -- yes --> I[Emit harness-proposal]
  I --> J[Run real vLLM trial over search range]
  J --> B
  H -- no --> K{Validator stop allowed?}
  K -- yes --> L[Emit harness-stop]
  K -- no --> M{LLM endpoint exists?}
  M -- yes --> N[Ask LLM backend]
  M -- no --> O[Fail loudly: no proposal source]
```

## Observation: harness 看到什么

每一轮 harness 不看自然语言日志做猜测，而是读结构化状态：

- `StudySpec`
  - hardware: GPU 数量、GPU 型号；
  - engine: base flags/envs、tunable flags/envs、topology constraints；
  - trace: request mode、window id、输入长度过滤、输出长度 override；
  - SLO: TTFT/TPOT rule、target pass rate；
  - search: load range、tolerance、probe budget。
- `window_summary` / `WorkloadProfile`
  - L: request length 分布、tail ratio；
  - C: prefix/cache reuse；
  - A: arrival rate、burstiness、interarrival variation。
- 最近 trials
  - config patch；
  - best feasible request rate；
  - request_rate_per_gpu；
  - pass rate；
  - probe history；
  - latency p50/p95/p99；
  - SLO failure reason counts；
  - launch/runtime failure stage。

这些数据会被压成 `recent_trial_diagnostics` 和 `trial_profiles`，后续步骤只使用这些结构化
字段。

## Bottleneck classifier: 怎么判断方向

Harness 维护一组 ranked bottleneck hypotheses：

```text
ttft_prefill
decode_tpot
admission_or_queueing
launch_or_memory
```

它的输入不是单一阈值，而是多类证据：

- workload default：长 prompt tail 更偏向 `ttft_prefill`；
- request mode：decode-only 且有 TPOT SLO 时更偏向 `decode_tpot`；
- probe sequence：最近 trial 的 active bottleneck 权重大于旧 trial；
- failed reason counts：
  - `ttft_ms>...` 支持 `ttft_prefill`；
  - `tpot_ms>...` 支持 `decode_tpot`；
  - `arrival_lag_s>` / `probe_elapsed_s>` 支持 `admission_or_queueing`；
- launch failure / OOM：支持 `launch_or_memory`。

代码里这不是一个硬编码单标签，而是带 confidence 的 ranked list。例如最近 probe
明确出现 TPOT failure，会提高 `decode_tpot` 分数；如果同时 workload 有长 prompt tail，
`ttft_prefill` 仍会保留为次级 hypothesis。

## Candidate family: raw knobs 如何变成 intervention

Harness 不直接在所有 tunable flags 上盲采样。它先把 knobs 分成有系统含义的
intervention family：

| Family | 代表 knobs | 机制含义 |
| --- | --- | --- |
| topology | `tensor-parallel-size`, `data-parallel-size`, EP knobs | 改变每请求并行度、replica 数量、通信/效率 tradeoff |
| batching | `max-num-batched-tokens`, `enable-chunked-prefill` | 改变 prefill/decode batching 与 HoL blocking |
| admission | `max-num-seqs` | 改变并发 admission 与 TPOT/TTFT tail |
| KV memory | `gpu-memory-utilization` | 改变 KV cache blocks 和可承载并发 |
| failure memory | failed signatures | 阻止重复已知 launch/runtime 失败方向 |

关键点是：candidate 来自当前 `StudySpec` 的 tunable schema 和 topology constraints。
例如 topology candidate 只枚举合法 TP/DP/EP 组合；如果 EP 没有直接证据，generic
topology search 不会主动引入 EP。

## Scoring: 为什么会先走 topology，再走 gmu

Candidate action 的评分大致是：

```text
score = expected_bottleneck_relief * bottleneck_confidence
      + information_gain
      + launch_safety
      - regression_risk
```

然后 `experiment_plan.next_action` 选择最高分候选。分数超过阈值时，harness 直接生成
proposal；否则进入 stop validator 或 LLM fallback。

这套 scoring 体现了几个系统原则：

1. Topology 是 serving 的一阶决策。
   当 TP frontier 还没测完，`gpu-memory-utilization`、`max-num-seqs` 这类 runtime
   微调不会抢在 topology 前面。

2. Topology 不是“越大越好”。
   评分和最终 winner 都看 `request_rate_per_gpu`，不是 raw request rate。TP4 可能总吞吐
   更高，但如果使用更多 GPU 后 per-GPU 效率下降，就不会成为 incumbent。

3. Runtime tuning 必须 anchored on incumbent topology。
   当 topology 已经验证过，runtime proposal 会 preserve 当前 best topology，只在其上
   调 `gpu-memory-utilization`、`max-num-seqs`、`max-num-batched-tokens`。

4. Measurement 决定最终答案。
   Candidate 只是一个 hypothesis；是否接受由真实 trial 的 SLO-constrained
   `request_rate_per_gpu` 决定。

5. Bad-start recovery 需要先 bracket，再微调。
   如果 no-LLM run 从一个很高 TP 的初始点开始，且同 DP 下更高 TP frontier 已经不存在
   或已测过，harness 会优先验证相邻低 TP，而不是把当前高 TP 当作 topology 已收敛。
   这避免了 `TP=8` 这类坏初始点直接进入 `gpu-memory-utilization` 微调。

6. Pathological runtime 起点需要跳回正常工作区间。
   `gpu-memory-utilization` 的常规策略是在 settled topology 上小步 hill-climb；
   但如果初始值明显低于正常工作区间，例如 `0.5`，harness 会先跳到 nominal floor
   `0.9`，再按 `0.02` 步长向 safe ceiling `0.97` 验证。

## Validator stop: 为什么不会过早停止

Harness stop 不是“找到一个不错配置就停”。当前 stop validator 包含几个条件：

- `search_high_saturated_by_incumbent`
  - incumbent 的最高 feasible probe 已经贴近 configured search high；
  - 说明当前测量范围已被打满。
- `topology_frontier_requires_probe`
  - 如果 active bottleneck 仍要求更高 TP frontier 且未测，禁止 stop。
- `experiment_plan_has_high_value_candidate`
  - 如果还有高分候选，禁止 stop。
- `post_incumbent_validation_exhausted`
  - strong incumbent 后至少要有 post-incumbent validation；
  - validation 覆盖 topology/runtime family 或达到足够数量；
  - 没有任何 validation trial 超过 incumbent；
  - 才允许 clean stop。

所以 validator 的作用是 fail-safe：

```text
wrong proposal 最多浪费一个 trial；
wrong stop 会终止搜索，所以必须由 deterministic validator 授权。
```

## Qwen30B no-LLM run 中具体发生了什么

Run:

```text
qwen30b-harness-only-medium-stopfix-dash1-20260624T144701Z
```

设置：

- Model: `Qwen/Qwen3-30B-A3B`
- Engine: community vLLM 0.20
- Hardware: 8x H20, 允许 TP/DP/EP frontier
- Trace: chat 0-8k, output 128, replay time scale 0.1
- SLO: target pass rate 0.95, TTFT step rule, TPOT 50ms
- LLM endpoint: `null`

真实 trial path:

| Trial | Source | Config patch | req/s/GPU | pass rate | Harness 解释 |
| --- | --- | --- | ---: | ---: | --- |
| 0001 | baseline | `{}` | 2.2000 | 1.0000 | 建立 baseline 和 probe evidence |
| 0002 | harness | `TP=2` | 3.2583 | 1.0000 | latency/SLO pressure 下先测 adjacent TP |
| 0003 | harness | `TP=4` | 2.0917 | 1.0000 | 验证更高 TP frontier；raw 总吞吐高但 per-GPU 低 |
| 0004 | harness | `TP=2, gmu=0.92` | 3.2583 | 1.0000 | topology 已 settle，开始 incumbent topology 上的 KV headroom climb |
| 0005 | harness | `TP=2, gmu=0.94` | 3.2583 | 1.0000 | 继续小步 hill-climb；未改善但未失败 |
| 0006 | harness | `TP=2, gmu=0.96` | 3.3333 | 1.0000 | KV headroom 带来更高 feasible frontier |
| 0007 | harness | `TP=2, gmu=0.97` | 3.4333 | 1.0000 | 达到 safe ceiling，成为 incumbent |
| 0008 | harness | `TP=4, DP=2` | 1.0458 | 1.0000 | post-incumbent topology validation，没有超过 incumbent |
| 0009 | harness | `TP=8` | 1.0458 | 1.0000 | 继续 frontier validation，没有超过 incumbent |
| 0010 | harness stop | stop | - | - | validator: `post_incumbent_validation_exhausted` |

这个过程里没有外部 LLM 决策。每一步 proposal 都来自 harness：

1. baseline 观测到当前 engine 在 SLO 下的可行 frontier；
2. bottleneck/机制模型认为 topology 是一阶干预；
3. 测 TP2，接受，因为 per-GPU 从 2.2 提到 3.2583；
4. 测 TP4，拒绝为 incumbent，因为 per-GPU 降到 2.0917；
5. topology frontier settle 后，在 TP2 上小步提升 `gpu-memory-utilization`；
6. `gmu=0.97` 达到 3.4333；
7. 再测 nearby topology，确认没有更好；
8. validator 授权 stop。

## 为什么这不是写死 Qwen30B

这条路径看起来像“harness 知道答案是 TP2+gmu0.97”，但代码机制不是这样写的。

没有写死的部分：

- 没有写死 model name；
- 没有写死 Qwen30B；
- 没有写死 `TP=2` 是最终答案；
- 没有写死 `gmu=0.97` 一定最好；
- 没有跳过真实测量；
- 没有把 TP4/TP8 直接判负，而是实际运行并比较。

真正写入 harness 的 domain knowledge 是：

- TP/DP/EP 是 topology family，必须满足 topology constraints；
- topology 通常是一阶 serving intervention，要先于 runtime 微调被验证；
- raw throughput 不等于目标，跨 topology 比较要用 `request_rate_per_gpu`；
- `gpu-memory-utilization` 是 KV memory headroom 微调，只应在 incumbent topology 上小步 hill-climb；
- launch failure 和 tested signatures 是 hard negative evidence；
- stop 必须由 validator 授权，不能由 proposer 自己说停就停。

这是一种系统机制约束，不是 case-specific prompt。

## 它和 BO / raw heuristic 的区别

普通 BO 或 raw heuristic 的搜索空间通常是：

```text
config = {tp, dp, ep, gmu, max_num_seqs, max_num_batched_tokens, ...}
score = measured req/s/GPU
```

这会产生几个问题：

- 它不知道哪些 knobs 是 topology family，哪些是 runtime family；
- 它可能在没测 TP frontier 前浪费大量 trial 调 runtime；
- 它可能重复已知 launch failure；
- 它可能把 raw throughput 高但 GPU efficiency 差的配置误当方向；
- 它很难解释“这个 trial 试图证伪哪个瓶颈 hypothesis”。

Harness-shaped search space 是：

```text
state -> bottleneck hypothesis -> legal intervention family -> measured verdict
```

因此 BO、bandit、LLM、deterministic heuristic 都可以接在 harness 后面。它们优化的不是
raw knob vector，而是有 serving 语义的 intervention graph。

这也是我们新 framing 的核心：

```text
black-box optimization
  -> grey-box / mechanism-guided experimental optimization
```

## 当前还需要补的证据

No-LLM Qwen30B run 证明了 deterministic harness 可以完整闭环，但 paper 还需要继续补：

1. Planner-agnostic ablation
   - `raw BO` vs `harness-guided BO`；
   - `raw heuristic` vs `harness deterministic policy`；
   - 证明收益来自 harness substrate，而不是某个 LLM。

2. Mechanism ablation
   - no attribution；
   - shuffled attribution；
   - no topology-first；
   - no intervention grammar；
   - no validator/failure memory。

3. Near-optimum evidence
   - 在 1-2 个 case 做局部 grid；
   - 证明 harness path 找到的是 near-optimal region，不只是一个可行 config。

4. Cross-case robustness
   - 再选 decode-heavy 或 long-prefill case；
   - 验证不同 workload/SLO 下 candidate family 会发生合理切换。

5. Bad-start recovery
   - 从非可信初始配置开始，例如 `TP=8, max-num-seqs=8, gmu=0.5`；
   - 证明 harness 不是只能从“已经比较合理”的 base config 出发；
   - 观察它是否能先恢复 topology，再恢复 runtime headroom，并最终回到同一 near-optimal
     region。

## Bad-start recovery 审计 - 2026-06-26

用户提出的问题是：如果我们不是从可信 base config 开始，而是从一个恶意或不合理的
配置开始，例如：

```text
TP=8, DP=1, max-num-seqs=8, gpu-memory-utilization=0.5
```

no-LLM harness 是否仍能自动找到正确方向？

目前结论要分开说：

1. **旧 planner 不能直接 claim 任意坏起点可恢复。**
   本地合成审计显示，旧逻辑会把 `TP=8` 误当作 topology frontier 已收敛，并把下一步
   proposal 设为 `gpu-memory-utilization=0.52`。这会在坏 topology 和坏 runtime 上
   做很慢的小步爬坡，不能作为 robust evidence。

2. **已补 planner 机制。**
   当前 harness 增加了两个 no-LLM deterministic recovery rules：
   - `bad_start_topology_bracket`：当当前 anchor 在高 TP，且没有未测的更高 TP frontier 时，
     先测相邻低 TP，例如 `TP=8 -> TP=4`；
   - `gmu_nominal_floor`：当 settled topology 上的 `gpu-memory-utilization < 0.9` 时，
     先跳到 `0.9`，再做常规 `0.92/0.94/.../0.97` hill-climb。

3. **已加本地回归测试，但还没做真机证明。**
   已通过的 planner tests：
   - `test_harness_brackets_down_from_bad_high_tp_start_before_runtime_tuning`
   - `test_harness_jumps_low_gpu_mem_util_to_nominal_floor_after_topology_settles`
   - 以及已有 topology-first / gmu-climb 相关回归测试。

因此，当前状态是：planner 侧已经能给出正确方向；paper 级别还需要真机 bad-start
recovery run 来确认真实 vLLM 测量下是否稳定收敛。

## 准备中的真机实验

实验目的不是再证明默认起点能 work，而是证明：

```text
same workload + same SLO + same no-LLM harness
不同初始 config
  -> 是否收敛到同一 near-optimal region
  -> 是否保持可解释 trial path
```

Base spec 使用已验证的 Qwen30B community vLLM 0.20 harness setup：

```text
configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json
```

运行时需要设置：

```json
{
  "llm": {
    "use_harness": true,
    "endpoint": null
  }
}
```

建议最小矩阵：

| Case | Base flags 变化 | 要验证的机制 | 预期 trial path |
| --- | --- | --- | --- |
| trusted-start-control | 保持现有可信 base | 对照已有 stopfix run | `TP=2 -> TP=4 -> TP=2+gmu climb -> stop` |
| bad-topology | `TP=8, DP=1` | 高 TP 起点是否会向下 bracket | `TP8 baseline -> TP4 -> TP2/或同等 better topology -> runtime` |
| bad-runtime | `TP=2, DP=1, gmu=0.5, max-num-seqs=8` | 低 KV headroom 是否跳回正常区间 | `gmu 0.5 baseline -> gmu 0.9 -> 0.92/...` |
| combined-bad | `TP=8, DP=1, gmu=0.5, max-num-seqs=8` | topology recovery 和 runtime recovery 能否串起来 | `TP8 -> TP4 -> TP2/nearby -> gmu 0.9 -> climb -> stop` |

成功判据：

- 不配置 LLM endpoint；所有 proposal 来自 harness；
- 不重复相同 config signature；
- high-TP 起点必须先出现相邻低 TP probe，而不是先做 `gmu=0.52`；
- low-gmu 起点必须先跳到 `0.9`，而不是 `0.52`；
- 在 12 个 measured trials 内达到 reference stopfix best 的 `>=95%`：

```text
reference best = 3.4333 req/s/GPU
95% threshold  = 3.2616 req/s/GPU
```

- 最终 stop 必须是 validator 授权，例如 `harness_stop`，而不是因为没有 proposal source
  失败退出。

如果真机结果失败，需要保留失败路径并分析是哪类机制不足：

- topology bracket 找到低 TP，但 runtime 仍无法恢复；
- `max-num-seqs=8` 导致 admission 太差，需要 admission recovery floor；
- baseline 自身全不可行，当前 harness 缺少 completed incumbent，不能进入正常 guided loop；
- vLLM launch/OOM 造成 failure memory 覆盖了可恢复路径。

## 一句话总结

No-LLM harness 能自动找到配置，是因为它已经实现了一个面向 serving 机制的实验 planner：
先把 trial 观测归因成 bottleneck，再把 bottleneck 映射成合法 intervention family，按
SLO-constrained req/s/GPU 真实测量更新 incumbent，最后由 validator 判断是否停止。
LLM 只是这个 planner 的一个可替换 proposal backend，而不是 AITuner 的必要核心。