1275 lines
51 KiB
Markdown
1275 lines
51 KiB
Markdown
来自大模型的 AITuner 优化建议:
|
||
|
||
# GPT
|
||
|
||
这个 prompt 已经比很多“让 LLM 直接调参”的做法强很多了:它至少给了目标、约束、baseline、search space、工具、输出 schema、最近状态和 ledger 占位。
|
||
但如果你的目标是做一个**足够好、足够有深度**的 agentic AITuner system,那么它目前最大的不足不是“信息太少”这么简单,而是:
|
||
|
||
1. **context 的层次还不够“因果化”**,更像一个大号任务说明书,而不是一个能支持系统性推理的状态表示。
|
||
2. **若干关键上下文缺失**,导致 agent 很难形成真正的 bottleneck-driven tuning。
|
||
3. **prompt 内部存在多处冲突/不一致**,会让 agent 学到坏行为,甚至直接产生非法 proposal。
|
||
4. **还没有把“实验科学”的结构显式化**:假设、证据、归因、置信度、复验条件、可转移经验都没有成为一等公民。
|
||
|
||
下面我分层分析。
|
||
|
||
---
|
||
|
||
## 1. 先给一个总判断:这个 prompt 目前更像“tool-using search agent”,还不是“research-grade tuning agent”
|
||
|
||
它现在支持的是:
|
||
|
||
* 先 profile hardware / model / workload
|
||
* 再 run benchmark
|
||
* 根据 recent runs 继续调参
|
||
* 最终给 recommended config
|
||
|
||
这已经是一个可工作的 optimization loop。
|
||
但一个更深入的 agentic AITuner,不只是“会调用工具 + 会改几个 knob”,而是应该具备下面四层能力:
|
||
|
||
* **State abstraction**:把系统状态抽象成可推理的 bottleneck state,而不是只看散落指标。
|
||
* **Hypothesis management**:每次实验都带着明确假设,实验后更新 belief。
|
||
* **Search control**:知道什么时候 exploration,什么时候 exploitation,什么时候 stop,什么时候做 ablation / validation / re-test。
|
||
* **Transfer / memory**:能够把这次 trace、这个 model、这类硬件上的经验迁移到下一轮决策。
|
||
|
||
你这个 prompt 目前主要覆盖了第一层的一小部分和第三层的一部分,但第二层、第四层基本还没有真正建立起来。
|
||
|
||
---
|
||
|
||
# 2. 最大问题一:prompt 里给的是“配置上下文”,但缺少“系统状态上下文”
|
||
|
||
这是最核心的问题。
|
||
|
||
现在 prompt 里信息最多的是:
|
||
|
||
* baseline config
|
||
* tunable config
|
||
* search space docs
|
||
* tools
|
||
* constraints
|
||
* recent runs
|
||
|
||
但一个 tuning agent 真正需要的,不是“参数说明书”,而是**系统当前处于什么 regime**。
|
||
换句话说,LLM 不是缺 knob,而是缺 **state**。
|
||
|
||
## 2.1 缺少端到端 bottleneck decomposition
|
||
|
||
你现在给的 bottleneck metrics 列表是:
|
||
|
||
* queue ratio
|
||
* waiting / running
|
||
* kv cache usage
|
||
* prompt tokens/s
|
||
* generation tokens/s
|
||
* preemption rate
|
||
* prefill/decode ratio
|
||
* request latency
|
||
|
||
这些是有帮助的,但还远远不够。
|
||
因为它们还没有把瓶颈拆到**哪一类资源在限制 goodput**:
|
||
|
||
* **GPU compute bound**
|
||
* **HBM / KV capacity bound**
|
||
* **scheduler / batching bound**
|
||
* **MoE communication bound**
|
||
* **host-side launch / Python / runtime overhead bound**
|
||
* **network bound**
|
||
* **tail-latency / long-prompt dominated bound**
|
||
* **admission-control / queueing dominated bound**
|
||
|
||
也就是说,agent 看到的是“指标”,看不到“结构化 bottleneck hypothesis”。
|
||
|
||
更好的做法是,prompt 里直接提供一个经过后处理的 `DerivedSystemState`,例如:
|
||
|
||
```json
|
||
{
|
||
"derived_state": {
|
||
"service_regime": "prefill_dominated",
|
||
"likely_primary_bottleneck": "prefill_compute_or_batching",
|
||
"likely_secondary_bottleneck": "moe_comm_disabled_but_model_is_moe",
|
||
"queueing_pressure": "high",
|
||
"kv_capacity_pressure": "low",
|
||
"slo_headroom": "moderate",
|
||
"parallelism_headroom": {
|
||
"tp_up_possible": false,
|
||
"dp_up_possible": true,
|
||
"ep_candidate": true
|
||
},
|
||
"confidence": 0.63
|
||
}
|
||
}
|
||
```
|
||
|
||
这会极大提高 LLM 的推理质量。
|
||
因为 LLM 最擅长处理**语义化、中层抽象**,不擅长从几十个原始指标里自己稳定地归纳出系统 regime。
|
||
|
||
---
|
||
|
||
## 2.2 缺少资源利用率与瓶颈定位的关键指标
|
||
|
||
你现在没有给:
|
||
|
||
* GPU SM utilization / active cycles
|
||
* HBM bandwidth utilization
|
||
* NVLink / PCIe / NIC utilization
|
||
* MoE all2all latency / bytes / imbalance
|
||
* CPU utilization / launcher overhead
|
||
* scheduler idle gaps / step interval jitter
|
||
* CUDA graph hit rate / replay hit rate
|
||
* kernel-level top hotspots
|
||
* batch composition statistics(每步 batch 的 seq 数 / token 数 / prefill 占比)
|
||
|
||
对于一个 serious 的 agentic AITuner,这些信息非常关键。
|
||
|
||
例如你的 workload 明显是:
|
||
|
||
* avg decode = 1
|
||
* p95 decode = 1
|
||
* prefill-only SLO
|
||
* avg prompt 1875,p95 7697
|
||
|
||
这已经强烈暗示是 **prefill-heavy / long-prompt dominated** 场景。
|
||
但 agent 仍然无法区分下面几种完全不同的现实:
|
||
|
||
1. GPU prefill matmul 已经接近打满
|
||
2. batch 组不起来,GPU 常空转
|
||
3. TTFT 被队列和调度延迟主导
|
||
4. 长 prompt chunking 策略造成额外切分开销
|
||
5. MoE expert dispatch 成本很高但未显式观察到
|
||
6. KV 不是瓶颈,但 block fragmentation 影响 admission
|
||
|
||
如果没有更底层的 profiling state,agent 往往只能凭经验猜 knob。
|
||
|
||
---
|
||
|
||
## 2.3 缺少“config → mechanism → metric”的因果桥梁
|
||
|
||
现在 prompt 里 search space table 说了“这个 knob 控制什么”,但没有给 agent 足够明确的**机制级解释**。
|
||
例如:
|
||
|
||
* `enable_expert_parallel` 改变的是 MoE layer sharding 和 communication pattern
|
||
* `block_size` 改变的是 KV allocation granularity / fragmentation / paging overhead
|
||
* `max_num_batched_tokens` 改变的是 batching aggressiveness 和 TTFT / throughput tradeoff
|
||
* `max_num_seqs` 改变的是 scheduler occupancy / fairness / tail latency
|
||
* `cuda_graph_sizes` / `compilation_config` 改变的是 launch overhead / graph capture coverage
|
||
|
||
这些目前是用自然语言弱描述的。
|
||
但一个更深的 agent 需要 prompt 里直接给出更明确的 causal priors,比如:
|
||
|
||
```json
|
||
{
|
||
"knob_priors": {
|
||
"engine_args.enable_expert_parallel": {
|
||
"helps_when": [
|
||
"is_moe=true",
|
||
"moe_comm_or_expert_sharding_is_bottleneck",
|
||
"tp-only causes poor expert locality or scaling"
|
||
],
|
||
"hurts_when": [
|
||
"interconnect is weak",
|
||
"batch is too small",
|
||
"all2all cost dominates"
|
||
],
|
||
"observable_signals": [
|
||
"high prefill time with moe model",
|
||
"poor scaling from tp increase",
|
||
"queue pressure while GPU util not saturated"
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
这会让 agent 不只是“试试 EP”,而是知道**为什么此时该试 EP**。
|
||
|
||
---
|
||
|
||
# 3. 最大问题二:prompt 缺了很多对 tuning 决策真正关键的上下文
|
||
|
||
下面这些缺失,会显著限制 agent 的深度。
|
||
|
||
## 3.1 缺少 baseline 的“完整实验语义”
|
||
|
||
你给了 baseline config,但没给 baseline 的来源和语义标签:
|
||
|
||
* 这是人工经验最优?
|
||
* 这是线上默认?
|
||
* 这是某个历史最好?
|
||
* 这是安全保守配置?
|
||
* 这是某种 workload class 下的最佳点?
|
||
|
||
这很重要。因为 agent 对 baseline 的利用方式依赖 baseline 的身份。
|
||
|
||
例如:
|
||
|
||
* 若 baseline 是“线上稳态保守配置”,agent 应更积极探索。
|
||
* 若 baseline 是“专家为同类 trace 精调后的配置”,agent 应优先做小 delta 和局部 ablation。
|
||
* 若 baseline 是“另一类 workload 的最优配置”,agent 应降低其先验权重。
|
||
|
||
建议在 prompt 中加入:
|
||
|
||
```json
|
||
{
|
||
"baseline_meta": {
|
||
"source": "human_expert",
|
||
"intended_workload_class": "prefill-heavy-long-context-moe",
|
||
"trust_level": "high",
|
||
"optimization_goal": "goodput_under_ttft_slo",
|
||
"known_limitations": ["not tuned for H20 PCIe topology"]
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 3.2 缺少 workload shape 的更细粒度统计
|
||
|
||
现在 workload profile 太粗,只给了:
|
||
|
||
* qps
|
||
* avg/p95 prompt
|
||
* avg/p95 decode
|
||
* p99 total tokens
|
||
|
||
这不够。
|
||
真正该给 agent 的 workload state 至少应包括:
|
||
|
||
* prompt length histogram / bucket distribution
|
||
* arrival burstiness(CV、p95 inter-arrival gap、burst windows)
|
||
* request size correlation(长 prompt 是否集中出现)
|
||
* time-local nonstationarity(trace 前后半段是否不同)
|
||
* percent of ultra-long prompts
|
||
* prefix reuse potential
|
||
* effective concurrency envelope under time_scale search
|
||
* head/tail windows 的差异
|
||
|
||
因为许多调参策略高度依赖 workload shape。
|
||
例如:
|
||
|
||
* 如果少量超长 prompt 决定 p95 TTFT,那么提高 `max_num_batched_tokens` 可能恶化 tail。
|
||
* 如果 workload bursty,`max_num_seqs` 和 queueing 策略影响会很大。
|
||
* 如果 prefix reuse 高,prefix caching 的价值大;若几乎无复用,那就不该把它当作重要因素。
|
||
|
||
建议 `profile_workload` 输出升级成:
|
||
|
||
```json
|
||
{
|
||
"prompt_len_buckets": {"0-512": 0.12, "512-2k": 0.44, "2k-8k": 0.36, "8k+": 0.08},
|
||
"interarrival_cv": 1.87,
|
||
"burstiness_score": 0.74,
|
||
"prefix_reuse_estimate": 0.03,
|
||
"long_prompt_tail_share": 0.11,
|
||
"stationarity": {
|
||
"head_vs_tail_qps_delta_pct": 22.0,
|
||
"head_vs_tail_prompt_p95_delta_pct": 35.0
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 3.3 缺少 model-serving co-design 所需的 model internals
|
||
|
||
你已经有:
|
||
|
||
* is_moe
|
||
* num_experts
|
||
* num_experts_per_tok
|
||
* num_kv_heads
|
||
* layers
|
||
* heads
|
||
|
||
但还缺很多对 serving 决策很关键的模型结构信息:
|
||
|
||
* 每层 expert hidden dims / FFN expansion
|
||
* attention FLOPs vs MoE FLOPs 占比估算
|
||
* parameter per layer / expert size
|
||
* activation memory estimate per token
|
||
* prefill vs decode compute ratio estimate
|
||
* router imbalance risk / top-k gating characteristics
|
||
* KV cache bytes per token per TP config
|
||
* TP / EP / DP 下的理论 memory & comm model
|
||
|
||
一个 research-grade agent 应该看到的不只是“这是 MoE”,而是:
|
||
|
||
* 这个模型在这个 TP/EP 组合下,理论上 compute / comm / memory 怎么变
|
||
* 哪种并行最可能改善 goodput
|
||
* 哪个 knob 改了以后,大概影响哪一类成本
|
||
|
||
建议显式加入 `AnalyticalProfile`:
|
||
|
||
```json
|
||
{
|
||
"analytical_profile": {
|
||
"estimated_prefill_flops_per_token": ...,
|
||
"estimated_decode_flops_per_token": ...,
|
||
"estimated_kv_bytes_per_token_per_gpu": {
|
||
"tp1": ...,
|
||
"tp2": ...,
|
||
"tp4": ...,
|
||
"tp8": ...
|
||
},
|
||
"moe_dispatch_cost_risk": "high",
|
||
"ep_candidate_strength": "strong",
|
||
"tp_scaling_limit_hint": "pcie_interconnect"
|
||
}
|
||
}
|
||
```
|
||
|
||
这类 analytical prior 很重要,它能把 agent 从“纯试验主义”提升成“模型 + 试验混合驱动”。
|
||
|
||
---
|
||
|
||
## 3.4 缺少 action space 的完整语义和局部依赖
|
||
|
||
你的 allowed config paths 很有限,这本身没问题;问题在于 agent 不知道:
|
||
|
||
* 哪些 knob 是“高影响主旋钮”
|
||
* 哪些是“条件旋钮”
|
||
* 哪些组合必须联动
|
||
* 哪些变化应该一起做,哪些应该单变量 ablation
|
||
* 某个 knob 的调整步长建议是什么
|
||
|
||
比如:
|
||
|
||
* `enable_expert_parallel=true` 时,通常应联动 `all2all_backend`
|
||
* 但你的 allowed config paths 里居然**没有 `engine_args.all2all_backend`**
|
||
* prompt 文本里却又一直提它
|
||
|
||
这会直接造成 agent 的 action space 语义断裂:
|
||
它知道这个 knob 重要,但又不能设置。
|
||
|
||
这是一个非常典型的“prompt context incomplete / inconsistent”问题。
|
||
|
||
---
|
||
|
||
# 4. 最大问题三:prompt 内部有多处自相矛盾,这会严重污染 agent 的决策
|
||
|
||
这一点非常关键。
|
||
比“少给点信息”更糟的是“给了互相冲突的信息”。
|
||
|
||
你的例子里至少有以下几个硬伤。
|
||
|
||
---
|
||
|
||
## 4.1 `block_size` 约束和 baseline 直接冲突
|
||
|
||
你给的 global rule 说:
|
||
|
||
* CUDA only supports `block_size <= 32`
|
||
* search space enum `{16, 32}`
|
||
* knob guidance 里也说 never use 8
|
||
|
||
但 baseline config 里却是:
|
||
|
||
```json
|
||
"block_size": 64
|
||
```
|
||
|
||
而且 baseline run 也真的用 64 跑通了。
|
||
|
||
这会给 agent 带来极大混乱:
|
||
|
||
* 到底 64 是非法值,还是文档过期?
|
||
* 如果 baseline 用 64 成功,是不是规则错了?
|
||
* 如果规则是硬约束,为什么系统自己允许?
|
||
* 那 agent 之后是否还该把 64 当合法候选?
|
||
|
||
这不是小问题。
|
||
**constraint inconsistency 会直接破坏 LLM 的 rule-following reliability。**
|
||
|
||
你应当保证三者完全一致:
|
||
|
||
* 文档约束
|
||
* allowed config paths / domains
|
||
* 实际运行系统
|
||
|
||
如果系统实际上支持 64,那么 prompt 必须更新;
|
||
如果 64 只是 legacy artifact,那么 baseline 需要标准化重写。
|
||
|
||
---
|
||
|
||
## 4.2 强制 O→I→A,但首轮又没有 bottleneck metrics
|
||
|
||
你定义了严格 requirement:
|
||
|
||
> Every run_trace_replayer proposal.selection_logic MUST follow exactly this structure
|
||
> Observation: cite at least 3 numeric metrics from recent bottleneck signals.
|
||
|
||
但在 first run 之前,state 明确写着:
|
||
|
||
* recent runs: none
|
||
* bottleneck signals: none
|
||
|
||
这意味着首轮 baseline/seed run 按规则根本无法合法构造 proposal。
|
||
|
||
结果你的 agent 实际上也没遵守这个要求,而是给了:
|
||
|
||
* `selection_logic: "human_baseline"`
|
||
|
||
这说明 prompt 的 rule 在 cold start 不可满足。
|
||
|
||
解决方法不是“让模型聪明点”,而是要把 protocol 写成两阶段:
|
||
|
||
* **Stage 0 / bootstrap**:允许无 OIA 的 profiling run / baseline run
|
||
* **Stage 1+**:必须使用 recent numeric evidence 构造 OIA
|
||
|
||
例如:
|
||
|
||
```text
|
||
Cold-start exception:
|
||
- If there is no successful benchmark run yet, you may execute one bootstrap run
|
||
without OIA evidence.
|
||
- After at least one successful run with diagnostics, all future runs must use OIA.
|
||
```
|
||
|
||
---
|
||
|
||
## 4.3 `run_trace_replayer requires hardware, model, workload profiles`,但没有要求“diagnostics availability”
|
||
|
||
你后续决策高度依赖 bottleneck metrics,但 baseline 结果里:
|
||
|
||
* `diag_engine_sample_count = 0`
|
||
* `diag_resource_sample_count = 0`
|
||
|
||
也就是说,benchmark 产物实际上没有 diagnostic samples。
|
||
这会导致后续 OIA requirement 依然不可满足。
|
||
|
||
一个更健壮的 prompt 必须显式处理:
|
||
|
||
* 如果 diagnostics 缺失,agent 应怎么做?
|
||
* 是否允许 fallback 到 aggregated metrics?
|
||
* 是否需要触发一个诊断模式 run?
|
||
* 是否需要降低结论置信度?
|
||
|
||
建议加入:
|
||
|
||
```text
|
||
If diagnostic metrics are unavailable or sample count is 0:
|
||
- mark bottleneck confidence as low;
|
||
- do not claim a specific bottleneck unless supported by aggregate evidence;
|
||
- prioritize runs that improve observability or isolate hypotheses.
|
||
```
|
||
|
||
---
|
||
|
||
## 4.4 `all2all_backend` 被反复提及,但不在 allowed config paths
|
||
|
||
文本中多次强调:
|
||
|
||
* EP 要搭配 all2all_backend
|
||
* backend choices 有 deepep_high_throughput / low_latency 等
|
||
|
||
但 allowed config paths 里没有:
|
||
|
||
* `engine_args.all2all_backend`
|
||
|
||
这会让 agent 卡在半残状态:知道该调,实际上不能调。
|
||
|
||
这会直接削弱 MoE/EP 调优能力,尤其你这个模型还是 MoE,且 prompt 明确强调它。
|
||
|
||
---
|
||
|
||
## 4.5 search space 里有 TP/PP/DP/PCP/DCP,但 allowed paths 实际只给了 TP/DP
|
||
|
||
你在 exploration policy 里说:
|
||
|
||
* early vary major knobs by large factors early (`tp/pp/dp/pcp/dcp`, ...)
|
||
|
||
但 allowed config paths 里没有:
|
||
|
||
* pipeline_parallel_size
|
||
* prefill_context_parallel_size
|
||
* decode_context_parallel_size
|
||
|
||
这会制造“虚假 action space”:
|
||
|
||
* system prompt 告诉 agent:这些是 major knobs
|
||
* actual action schema 又不允许设置
|
||
|
||
这种不一致非常伤。
|
||
LLM 会把很多 token 预算花在思考一个它根本不能控制的维度。
|
||
|
||
---
|
||
|
||
## 4.6 `kind` 说明和实际调用不一致
|
||
|
||
你定义:
|
||
|
||
* `kind: seed | iter (optional)`
|
||
|
||
但 agent 实际调用用了:
|
||
|
||
* `"kind": "baseline"`
|
||
|
||
如果真实 executor 接受 baseline,schema 就不对;
|
||
如果 schema 对,agent 就在 violation。
|
||
|
||
这种 schema drift 会让 agent 学不到稳定 protocol。
|
||
|
||
---
|
||
|
||
## 4.7 “keep trace_minutes fixed” 与 “before stopping, validate top configs on full configured trace_minutes” 其实表达含混
|
||
|
||
你说:
|
||
|
||
* Keep trace_minutes and max_requests fixed to configured values; do not override
|
||
* Before stopping, validate top configs on the full configured trace_minutes
|
||
|
||
但当前 configured trace_minutes 本来就是 10.0,profile_workload 结果却显示 duration_minutes 约 5.0,run 里又有 binary search on time_scale。
|
||
|
||
这里会让 agent 不够确定:
|
||
|
||
* 它看到的 workload profile 是不是已经被 filters/time_scale 改过?
|
||
* baseline 的 goodput 是在 full trace 上,还是某个 effective scale 上?
|
||
* “full configured trace_minutes” 和 “binary searched time_scale” 的关系是什么?
|
||
|
||
建议明确区分:
|
||
|
||
* **Trace window**:固定 10 分钟 head window
|
||
* **Arrival scaling**:由 system-controlled binary search 改变 load
|
||
* **Evaluation protocol**:所有 config 均在相同 window 上比较,goodput 由 system 的 time_scale search 输出
|
||
|
||
否则 agent 很容易误解 baseline 的 comparability。
|
||
|
||
---
|
||
|
||
# 5. 最大问题四:当前 prompt 还没有显式支持“科学化实验流程”
|
||
|
||
一个足够有深度的 agentic AITuner,不应该只是“跑配置 → 看好不好”。
|
||
它应该像一个系统研究者一样维护实验逻辑。
|
||
|
||
你现在缺少以下几个关键对象。
|
||
|
||
---
|
||
|
||
## 5.1 缺少“假设对象”
|
||
|
||
每次 run 目前只有:
|
||
|
||
* selection_logic
|
||
* why_better
|
||
* expected_improvements
|
||
|
||
这还不够。
|
||
应该让每次实验都携带一个显式 hypothesis,例如:
|
||
|
||
```json
|
||
{
|
||
"hypothesis": {
|
||
"id": "H-ep-prefill-moe-001",
|
||
"statement": "For this prefill-heavy MoE workload on 8xH20 PCIe, enabling EP with a high-throughput all2all backend will improve goodput more than increasing TP, because MoE sharding and communication pattern dominate prefill cost.",
|
||
"depends_on": ["model.is_moe", "decode_tokens≈1", "queue_pressure_high"],
|
||
"falsifiable_by": [
|
||
"goodput gain < 3%",
|
||
"ttft p95 worsens > 5%",
|
||
"queue ratio unchanged while engine prompt throughput decreases"
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
这样 agent 才能做真正的 belief update,而不是只做 trial-and-error。
|
||
|
||
---
|
||
|
||
## 5.2 缺少“实验设计类型”
|
||
|
||
不是所有 run 都是同一种 run。
|
||
你需要区分:
|
||
|
||
* bootstrap baseline
|
||
* broad exploration
|
||
* local refinement
|
||
* ablation
|
||
* confirmation / re-test
|
||
* transfer test
|
||
* robustness validation
|
||
|
||
如果 prompt 不区分这些,agent 容易:
|
||
|
||
* 过早 exploit
|
||
* 反复测试相似 config
|
||
* 把 noisy improvement 当成真 improvement
|
||
* 不知道什么时候该做复验
|
||
|
||
建议每个 run 明确 `experiment_type`。
|
||
|
||
---
|
||
|
||
## 5.3 缺少“结果置信度”和“噪声模型”
|
||
|
||
goodput benchmark 不是 deterministic oracle。
|
||
你现在没有告诉 agent:
|
||
|
||
* 同一 config 的 run-to-run variance 大概多少
|
||
* cache hit / warmup 是否影响结果
|
||
* 什么幅度的 improvement 才算显著
|
||
* 何时需要 re-run 才能确认
|
||
|
||
于是 agent 会把 1~2% 的 gain 当真。
|
||
|
||
建议 prompt 里加入:
|
||
|
||
```json
|
||
{
|
||
"evaluation_policy": {
|
||
"min_effect_size_pct_for_claim": 3.0,
|
||
"rerun_if_gain_between_pct": [1.0, 5.0],
|
||
"rerun_if_slo_margin_small": true,
|
||
"prefer_ablation_for_multi_knob_gain": true
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5.4 缺少“derived memory / exploration ledger”的结构化表达
|
||
|
||
你自己前面就已经意识到这一点了,这里正好能对应上。
|
||
|
||
当前 prompt 只有:
|
||
|
||
* failed config ledger: none
|
||
* recent runs: ...
|
||
* current baseline values ...
|
||
|
||
这远不够。
|
||
真正有效的 ledger 不只是“失败配置列表”,而是至少要有:
|
||
|
||
* tried configs
|
||
* outcome
|
||
* dominant observed signals
|
||
* inferred cause
|
||
* confidence
|
||
* neighborhood summary
|
||
* equivalence / redundancy relation
|
||
* what has been falsified
|
||
|
||
例如:
|
||
|
||
```json
|
||
{
|
||
"exploration_ledger": {
|
||
"tested_regions": [
|
||
{
|
||
"region_id": "R1",
|
||
"predicate": "tp=4, ep=false, block_size in {32,64}, max_num_batched_tokens in [4k,12k]",
|
||
"best_run": "run_07",
|
||
"finding": "batching beyond 8k increases tail TTFT faster than throughput",
|
||
"confidence": 0.71
|
||
}
|
||
],
|
||
"falsified_hypotheses": [
|
||
{
|
||
"hypothesis_id": "H-tp-up-001",
|
||
"reason": "tp increase reduced effective goodput under PCIe topology"
|
||
}
|
||
],
|
||
"best_known_config": {...},
|
||
"redundant_proposals_to_avoid": [...]
|
||
}
|
||
}
|
||
```
|
||
|
||
这会让 agent 从“短期记忆搜索”变成“显式科研过程”。
|
||
|
||
---
|
||
|
||
# 6. 从这个具体例子出发,prompt 已经暴露出哪些不足
|
||
|
||
你给的 transcript 本身就很说明问题。
|
||
|
||
---
|
||
|
||
## 6.1 agent 首轮只做了 baseline replay,没有真正利用 workload/model/hardware 做初始 hypothesis
|
||
|
||
在拿到 profile 之后,系统其实已经知道很多强信号:
|
||
|
||
* 8x H20,PCIe topology
|
||
* Qwen3-235B-A22B,MoE,128 experts,top-8
|
||
* num_kv_heads = 4
|
||
* prefill-heavy,decode≈1
|
||
* prompt 很长,p95 7697
|
||
* objective 是 TTFT-based goodput
|
||
|
||
一个更强的 agent 在这里应该立刻形成若干高价值 hypothesis:
|
||
|
||
* 这是 prefill-dominated MoE 长 prompt 场景,`max_num_batched_tokens` / `max_num_seqs` / EP 应优先考虑
|
||
* decode-side knobs 几乎不重要
|
||
* PCIe 拓扑下 TP 扩大不一定划算,EP 的收益依赖 comm backend
|
||
* block size 与 KV fragmentation 可能相关,但 decode≈1 时收益未必最大
|
||
* chunked prefill 行为应该非常关键
|
||
|
||
但当前 prompt 没有要求 agent 显式写出这些 hypotheses。
|
||
于是 agent 只是机械地 benchmark baseline。
|
||
|
||
这不是 agent 不聪明,而是 prompt 没要求“先形成问题表征”。
|
||
|
||
---
|
||
|
||
## 6.2 agent 复制了大量非 allowed / 非必要字段,说明 action schema 不够清晰
|
||
|
||
`run_trace_replayer` 的 config patch 里出现了大量其实不在 allowed config paths 中的字段:
|
||
|
||
* swap_space
|
||
* enable_prefix_caching
|
||
* disable_hybrid_kv_cache_manager
|
||
* kv_transfer_config
|
||
* hf_overrides
|
||
* kv_cache_dtype
|
||
* enforce_eager
|
||
* quantization
|
||
* 一堆 envs
|
||
|
||
这说明 prompt 的 action contract 没真正压缩成一个**最小、清晰、可控的 proposal language**。
|
||
agent 实际上在复刻 baseline 全量 config,而不是只操作 allowed knobs。
|
||
|
||
一个更好的系统应该把 `config` 明确拆成:
|
||
|
||
* immutable baseline context
|
||
* mutable tunable patch
|
||
|
||
并只让 agent 输出 patch。
|
||
否则 LLM 很容易把整个 baseline 抄一遍,导致:
|
||
|
||
* action 冗长
|
||
* 容易引入不必要 drift
|
||
* 不利于 ledger 去重和 neighborhood reasoning
|
||
|
||
---
|
||
|
||
## 6.3 baseline 结果缺诊断,但 prompt 没有 fallback strategy
|
||
|
||
这是这个例子里最大的 operational flaw 之一。
|
||
|
||
baseline 返回:
|
||
|
||
* `diag_engine_sample_count = 0`
|
||
* `diag_resource_sample_count = 0`
|
||
|
||
意味着后续 agent 根本拿不到你要求的 numeric bottleneck evidence。
|
||
如果系统不在 prompt 里处理这件事,agent 后续只能:
|
||
|
||
* 违背规则硬猜
|
||
* 或停住
|
||
* 或不断做拍脑袋搜索
|
||
|
||
这说明 prompt 缺少“观测失败 / 诊断缺失时的控制流”。
|
||
|
||
---
|
||
|
||
# 7. 对一个更成熟的 agentic AITuner,prompt 应该怎么升级
|
||
|
||
下面我给的是更系统化的建议,不只是“补几个字段”。
|
||
|
||
---
|
||
|
||
## 7.1 把 prompt 从“任务描述”升级成“状态机输入”
|
||
|
||
你现在的 prompt 是一大段文字 + 当前状态。
|
||
更好的做法是把它组织成几个稳定的 JSON section:
|
||
|
||
```json
|
||
{
|
||
"task_spec": {...},
|
||
"system_state": {...},
|
||
"search_space": {...},
|
||
"observability_state": {...},
|
||
"memory_state": {...},
|
||
"decision_policy": {...},
|
||
"output_contract": {...}
|
||
}
|
||
```
|
||
|
||
这样 LLM 更容易做 consistent parsing,也更利于后续蒸馏/训练。
|
||
|
||
---
|
||
|
||
## 7.2 显式加入 `DerivedSystemState`
|
||
|
||
这是最重要的增强之一。
|
||
建议不是只给 raw profile,而是额外给派生状态:
|
||
|
||
* workload regime
|
||
* bottleneck hypothesis candidates
|
||
* resource pressure summary
|
||
* SLO headroom
|
||
* exploration stage
|
||
* confidence
|
||
|
||
这相当于给 LLM 一个“中间表征”。
|
||
|
||
---
|
||
|
||
## 7.3 让每次 run 必须带 hypothesis 和 experiment type
|
||
|
||
建议输出 schema 从现在的:
|
||
|
||
* selection_logic
|
||
* why_better
|
||
* expected_improvements
|
||
|
||
升级为:
|
||
|
||
```json
|
||
{
|
||
"experiment_type": "bootstrap|explore|ablation|refine|confirm|validate",
|
||
"hypothesis": {...},
|
||
"proposal": {...},
|
||
"expected_outcomes": {...},
|
||
"stop_condition_update": {...}
|
||
}
|
||
```
|
||
|
||
这样整个过程才更像科学实验,而不是盲搜。
|
||
|
||
---
|
||
|
||
## 7.4 把 memory / ledger 从“日志”变成“可推理对象”
|
||
|
||
至少加这几块:
|
||
|
||
* `best_known_config`
|
||
* `tested_regions`
|
||
* `falsified_hypotheses`
|
||
* `pending_hypotheses`
|
||
* `config_equivalence_or_redundancy`
|
||
* `recent_effect_estimates`
|
||
|
||
这会显著减少重复试验。
|
||
|
||
---
|
||
|
||
## 7.5 加入“knob interaction priors”
|
||
|
||
像下面这种 priors 很值得直接放进 prompt:
|
||
|
||
* prefill-heavy + long prompts → prioritize `max_num_batched_tokens`, chunked prefill, EP for MoE
|
||
* decode≈1 → de-emphasize decode-oriented tuning
|
||
* PCIe topology → be cautious with communication-heavy scaling
|
||
* MoE + EP enabled → backend choice is first-order knob
|
||
* high queue ratio + low running mean → scheduling/admission/batching issue
|
||
* high KV usage + preemption / waiting → capacity / fragmentation issue
|
||
|
||
LLM 很吃这种 high-level priors。
|
||
|
||
---
|
||
|
||
## 7.6 加入“非法 / 过时 / 冲突规则检查器”
|
||
|
||
在给 agent prompt 前,系统端应该先做 rule sanitization:
|
||
|
||
* baseline 是否违反 current constraints
|
||
* search space 是否和 allowed paths 一致
|
||
* docs 是否过时
|
||
* output schema 是否和 tool schema 对齐
|
||
|
||
否则 prompt 会把系统自身的不一致传给 LLM。
|
||
|
||
---
|
||
|
||
# 8. 我会把目前这个 prompt 的不足,归纳成 8 个核心缺口
|
||
|
||
为了更便于你后面写 paper / internal memo,我给一个比较“研究化”的概括。
|
||
|
||
## 缺口 1:缺少中层状态抽象
|
||
|
||
现在主要是 raw config + raw metrics,缺少 derived bottleneck state。
|
||
|
||
## 缺口 2:缺少机制级因果先验
|
||
|
||
knob 作用写得太浅,没形成 config→mechanism→metric 的因果桥梁。
|
||
|
||
## 缺口 3:缺少高价值 observability
|
||
|
||
没有 GPU / comm / scheduler / graph / kernel 等关键诊断。
|
||
|
||
## 缺口 4:缺少结构化 hypothesis management
|
||
|
||
没有显式假设、证伪条件、belief update。
|
||
|
||
## 缺口 5:缺少实验类型与统计置信度建模
|
||
|
||
没有区分 explore/refine/confirm,也没有 effect size / rerun policy。
|
||
|
||
## 缺口 6:缺少长期记忆与区域级探索摘要
|
||
|
||
只有 recent runs,没有 tested regions / falsified directions。
|
||
|
||
## 缺口 7:action space 与 search space 不一致
|
||
|
||
文档里说能调,实际 allowed paths 不能调。
|
||
|
||
## 缺口 8:constraint / protocol 自相矛盾
|
||
|
||
block_size、OIA requirement、kind schema、diagnostics availability 都有冲突。
|
||
|
||
---
|
||
|
||
# 9. 如果只允许你优先改 5 件事,我建议按这个顺序来
|
||
|
||
## 第一优先级:修 prompt 内部一致性
|
||
|
||
先把这些对齐:
|
||
|
||
* block_size 约束 vs baseline
|
||
* allowed paths vs search space
|
||
* all2all_backend 是否允许
|
||
* kind schema
|
||
* cold-start OIA 例外
|
||
|
||
这是地基。
|
||
|
||
## 第二优先级:加入 derived bottleneck state
|
||
|
||
不要只给 raw metrics,要给:
|
||
|
||
* queueing pressure
|
||
* kv pressure
|
||
* comm pressure
|
||
* likely bottleneck candidates
|
||
* confidence
|
||
|
||
## 第三优先级:把 ledger 显式结构化
|
||
|
||
至少要有:
|
||
|
||
* best known
|
||
* falsified hypotheses
|
||
* tested regions
|
||
* repeated configs to avoid
|
||
|
||
## 第四优先级:把每次 run 变成 hypothesis-driven
|
||
|
||
引入:
|
||
|
||
* experiment_type
|
||
* hypothesis
|
||
* falsifiable_by
|
||
* confidence update
|
||
|
||
## 第五优先级:扩充 observability
|
||
|
||
至少补:
|
||
|
||
* GPU util
|
||
* comm bytes/latency
|
||
* scheduler idle / batch fill
|
||
* graph hit rate
|
||
* batch composition
|
||
|
||
---
|
||
|
||
# 10. 一个更好的 prompt 骨架应该长什么样
|
||
|
||
不是完整重写,只给结构骨架:
|
||
|
||
```json
|
||
{
|
||
"task_spec": {
|
||
"objective": "maximize goodput_qps_per_gpu",
|
||
"slo": {...},
|
||
"evaluation_protocol": {...}
|
||
},
|
||
"immutable_context": {
|
||
"hardware_profile": {...},
|
||
"model_profile": {...},
|
||
"workload_profile": {...},
|
||
"baseline_config": {...},
|
||
"baseline_meta": {...}
|
||
},
|
||
"derived_state": {
|
||
"workload_regime": "prefill-heavy-long-context",
|
||
"system_regime": "queueing-sensitive",
|
||
"bottleneck_candidates": [...],
|
||
"confidence": ...
|
||
},
|
||
"search_space": {
|
||
"allowed_paths": [...],
|
||
"interaction_priors": {...},
|
||
"hard_constraints": [...],
|
||
"soft_guidance": [...]
|
||
},
|
||
"memory_state": {
|
||
"best_known_config": {...},
|
||
"tested_regions": [...],
|
||
"falsified_hypotheses": [...],
|
||
"recent_runs": [...]
|
||
},
|
||
"decision_policy": {
|
||
"cold_start_policy": {...},
|
||
"exploration_policy": {...},
|
||
"refinement_policy": {...},
|
||
"validation_policy": {...},
|
||
"noise_policy": {...}
|
||
},
|
||
"output_contract": {
|
||
"must_include": [
|
||
"experiment_type",
|
||
"hypothesis",
|
||
"observation",
|
||
"inference",
|
||
"action"
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
# 11. 最后给一个最凝练的结论
|
||
|
||
如果一句话概括:
|
||
|
||
**你现在这个 prompt 最大的问题,不是“给 LLM 的信息不够多”,而是“给了很多配置与规则信息,但没有把这些信息组织成可做系统性归因和实验决策的状态表示”。**
|
||
|
||
更具体地说:
|
||
|
||
* 现在它擅长让 agent **做配置搜索**
|
||
* 但还不擅长让 agent **做系统研究式调优**
|
||
|
||
而一个真正有深度的 agentic AITuner,应该让 LLM看到的不是:
|
||
|
||
* “这里有 baseline、这里有 knob、你去试”
|
||
|
||
而是:
|
||
|
||
* “当前系统处于什么 regime”
|
||
* “最可能的瓶颈是什么”
|
||
* “哪些假设已经证伪”
|
||
* “这次实验是在验证什么”
|
||
* “多大 improvement 才算真的 improvement”
|
||
* “这个结果为什么能迁移到下一个决策”
|
||
|
||
这才是从“LLM 会调参”走向“agentic native system tuning”的关键一步。
|
||
|
||
如果你愿意,我下一步可以直接帮你做两件事里的一个:
|
||
|
||
1. 把这份 prompt 重构成一个**research-grade prompt schema v2**;
|
||
2. 把上面的分析整理成一份适合你内部汇报/论文 related discussion 的中文备忘录。
|
||
|
||
|
||
# Gemini
|
||
|
||
这是一个非常精巧且工程化程度极高的 Agentic Tuning 系统,尤其是结合了 Binary Search 测算 Goodput 以及明确的 `O->I->A` (Observation -> Inference -> Action) 推理链,展现了很深的系统级思考。
|
||
|
||
不过,从让 LLM 真正能够像顶尖 Systems Researcher 那样进行 deep reasoning 的角度来看,目前注入的 Context 和运行时的状态机仍存在一些逻辑断层和系统级上下文的缺失。以下是详细的不足分析及改善建议:
|
||
|
||
### 1. 致命的逻辑死锁:诊断指标缺失与 O-I-A 强制规则的冲突
|
||
|
||
这是目前 Prompt 中最容易导致 Agent 陷入幻觉或卡死的点。
|
||
|
||
- **不足**:Prompt 的 `Bottleneck analysis requirement (strict)` 强制要求 Agent 必须引用至少 **3 个**来自近期瓶颈信号的 `diag_engine_*` 指标。然而,在 Baseline 的 `tool_results` 返回中,`aggregated` 里的 `diag_resource_sample_count: 0.0` 和 `diag_engine_sample_count: 0.0`,这意味着底层 trace_replayer 并未成功采集或暴露出关键的 profiling 指标(如 `diag_engine_kv_cache_usage_perc_mean`)。
|
||
|
||
- **后果**:由于底层没返回数据,而系统 prompt 又拿枪指着 Agent 的头必须填 3 个指标,Agent 只能“胡编乱造”(幻觉)出 3 个看似合理的数字来满足 JSON schema,这将彻底毁掉后续优化的科学性。
|
||
|
||
- **改善**:
|
||
|
||
1. **修复底层 Profiler**:确保 `run_trace_replayer` 能够真正在异步后台定期 poll vLLM 的 engine metrics 并聚合成结果。
|
||
|
||
2. **放宽冷启动规则**:在 Prompt 中增加条件分支:“If diagnostic metrics are missing or count is 0.0 (e.g., during initial baseline), you may skip citing them in the Observation and rely on latency/throughput metrics instead, but you MUST issue a config change to enable metrics profiling if available.”
|
||
|
||
|
||
### 2. 核心硬约束与 Baseline 配置的直接矛盾
|
||
|
||
- **不足**:在 `Global rules (constraints)` 和 `Knob guidance` 中明确写道:“On **CUDA devices**, only `block_size <= 32` is supported” 且 “`block_size` must be one of {16, 32}; never use 8”。但在给出的 `Human baseline config` 中,赫然写着 `"block_size": 64`。
|
||
|
||
- **后果**:这种自相矛盾的 Context 会让大模型产生极大的困惑。如果它在后续 action 中保持 64,就会违反你的 prompt 规则;如果它改为 32,可能又会偏离你提供 baseline 的本意。
|
||
|
||
- **改善**:统一事实。如果是 Hopper/Ada 架构的某些新版 vLLM 已经支持 64 及以上的 block size(事实上 vLLM 确实支持更大的 block size,只是会影响显存碎片率),请更新 Prompt 中的 Hard constraint,移除 `<= 32` 的过时限制。
|
||
|
||
|
||
### 3. 缺乏关键的硬件拓扑与 MoE 通信上下文 (Systems Context)
|
||
|
||
Agent 要优化的是一个 **235B 的大参数量 MoE 模型** (128 专家,激活 8 专家),而硬件环境是 **8x NVIDIA H20, nvlink_topology: "pcie"**。这是一个极其极端的配置。
|
||
|
||
- **不足**:H20 在 PCIe 拓扑下,GPU 之间的 P2P 带宽远低于 NVLink(通常双向只有几十 GB/s)。对于 MoE 而言,如果开启 Expert Parallelism (`--enable-expert-parallel`),All-to-All 的通信开销在 PCIe 上将是灾难性的。Agent 目前并不知道 `nvlink_topology: "pcie"` 意味着“极弱的跨卡通信带宽”。
|
||
|
||
- **改善**:
|
||
|
||
1. 在 Hardware 描述或 Knob guidance 中增加拓扑对性能影响的 heuristics。例如:“**Topology Guidance**: The current nvlink_topology is `pcie`. Be highly cautious with Expert Parallelism (EP) or large Tensor Parallelism (TP) crossing PCIe switches, as All-to-All communication will be severely bottlenecked. Prefer maximizing Data Parallelism (DP) or Context Parallelism (PCP) if memory permits.”
|
||
|
||
2. 补充 DeepEP 的适用场景:环境变量中提供了多种 All2All backend(如 `deepep_low_latency`),应告诉 Agent 哪些 backend 更适合 PCIe 低带宽环境。
|
||
|
||
|
||
### 4. “Goodput” 与 Binary Search 黑盒化导致推理变浅
|
||
|
||
- **不足**:系统的最终目标是最大化 `goodput_qps_per_gpu`。从日志可以看出,底层系统通过二分查找 `time_scale`(当前找到了 `0.5859375`)来施加不同的 QPS 压力,直到打破 TTFT SLO ($TTFT_{p95} \le 0.001 \times L + 1.0$)。但 Agent 对于“时间缩放系数是如何影响 Goodput 的”缺乏直观的认知逻辑。
|
||
|
||
- **改善**:在 `Optimization policy` 中用一句精炼的话点透背后的排队论(Queueing Theory)逻辑:“**Goodput Physics**: The tuner binary-searches the `time_scale` to find the maximum request arrival rate before the system queues saturate and break the TTFT SLO. To increase goodput, you must reduce per-request compute time (improving max capacity) or improve scheduling overlap (e.g., chunked prefill) to prevent queue buildup.” 这能引导 Agent 在 Inference 阶段思考排队延迟,而不仅仅是单次推理延迟。
|
||
|
||
|
||
### 5. 显存算账能力 (Memory Math) 的缺失
|
||
|
||
- **不足**:系统调优最大的红线是 OOM。对于 158B 活跃参数的 MoE 模型(假设 FP8 需要 ~160GB,哪怕 8 张 96GB 卡总显存充裕,但如果做了 TP=4, DP=2 分片,每张卡还要承载巨大的 KV Cache)。Agent 没有任何工具去预估当前配置下的 KV Cache 可用配额(KV Cache Blocks 数目)。
|
||
|
||
- **改善**:提供一个 `estimate_memory_footprint(config)` 的伪工具,或者在 Baseline 的 metric 中暴露 `gpu_kv_cache_budget_gb` 或 `num_gpu_blocks`。这样 Agent 在扩大 `max_num_seqs` 或 `max_num_batched_tokens` 时,就能有理有据地判断是否会 OOM,而不是瞎猜。
|
||
|
||
|
||
### 6. Search Space 暴露的不对称
|
||
|
||
- **不足**:`Allowed config paths` 列表严重偏科。Baseline 配置中包含非常多深度的 vLLM 环境变量(如 `ACCL_LOW_LATENCY_OPTIMIZE`, `VLLM_MOE_USE_DEEPEP` 以及各类 FP8 算子选项),但允许 Agent 修改的白名单却只开放了极少数几个(如 `envs.VLLM_MOE_USE_DEEPEP` 等 4 个 envs)。如果这是一个深度 Tuning 系统,目前的白名单过于保守,限制了 Agent 发挥真正的“系统黑客”能力。
|
||
|
||
- **改善**:将更多关键的网络/通信层参数(特别是 NVSHMEM 和 ACCL 相关的参数,对于 H20 PCIe 极其重要)加入 Allowed paths;或者明确在 Prompt 中声明:“For environment variables not listed in Allowed config paths, you may request overriding them in the rationale, but the system will mock them for safety” 供后续收集人工规则。
|
||
|
||
|
||
---
|
||
|
||
需要我针对某一个点(比如如何重新设计 `Bottleneck analysis requirement` 的 Prompt 段落,让其兼顾强制力与冷启动的容错性)给出具体的修改示例吗?
|
||
|
||
|
||
# Sonnet
|
||
|
||
This is a rich, complex prompt with several meaningful gaps. Here's a deep, structured critique across the major dimensions:
|
||
|
||
---
|
||
|
||
## 1. Hard Constraint Contradiction: `block_size` 64 vs. ≤32 Rule
|
||
|
||
The search space spec states unambiguously: **"Hard constraint (CUDA): `block_size <= 32`. Always choose from allowed enum `{16, 32}`."** The knob guidance reinforces: _"block_size must be one of {16, 32}; never use 8."_
|
||
|
||
Yet the human baseline uses `block_size: 64`, and the system runs it successfully. The agent never flags this. This creates a damaging inconsistency: the constraint table the agent is supposed to enforce says one thing, the baseline config the agent is told to inherit from says something different, and the actual benchmark silently accepts it. The agent is now in an epistemically broken position — it cannot know whether `block_size=64` is a special platform override, a documentation lag, or a bug. It has no mechanism to resolve this, no tool to query valid block sizes at runtime, and no instruction to challenge the baseline when it violates a stated hard constraint.
|
||
|
||
**Fix:** Either update the constraint table to reflect the actual allowed values for this specific platform, or add a `validate_config` tool (or response from `get_constraints`) that resolves the ground-truth allowed enum dynamically. The agent should also receive explicit instruction: "If the baseline config violates a stated constraint, flag it and query for resolution before proceeding."
|
||
|
||
---
|
||
|
||
## 2. Bottleneck Signal Infrastructure is Broken, Making O→I→A Structurally Unfulfillable
|
||
|
||
The prompt mandates a strict Observation→Inference→Action chain for every `run_trace_replayer` call, requiring citation of **at least 3 numeric metrics from `diag_engine_*` signals**. But the baseline result shows:
|
||
|
||
```
|
||
diag_engine_sample_count: 0.0
|
||
diag_resource_sample_count: 0.0
|
||
diag_metric_collection_error_count: 0.0
|
||
```
|
||
|
||
Every single diagnostic metric is zero. There is no data in any `diag_engine_*` field. The policy requirement is now permanently unfulfillable in honest form — the agent is being asked to cite real bottleneck evidence that doesn't exist. This forces the agent to either fabricate the O→I→A chain (a quality and reliability failure) or deadlock waiting for data that never arrives.
|
||
|
||
The prompt provides no explanation of _why_ diagnostics might be empty (metric collection agent not running? sampling interval mismatch? scraping disabled?), no fallback policy when diagnostic data is unavailable, and no tool to diagnose the diagnostic failure itself.
|
||
|
||
**Fix:** Add a fallback policy: "If `diag_engine_sample_count == 0`, use workload profile statistics (avg/p95 prompt tokens, qps, ttft distribution) and run results (latency percentiles, queue ratio proxies) as the Observation basis, and note that engine diagnostics are unavailable." Also add explicit guidance on what causes missing diagnostics and how to request metric collection be enabled.
|
||
|
||
---
|
||
|
||
## 3. Search Space Table vs. Allowed Config Paths: Massive Coverage Gap
|
||
|
||
The search space table documents 15 tunable parameters across parallelism, context-parallel, MoE, overlap/dual-batch, KV cache, batching, and prefill categories. The "Allowed config paths" section permits only 14 paths in total. The following search space parameters are **completely absent** from allowed config paths:
|
||
|
||
- `engine_args.pipeline_parallel_size` (PP)
|
||
- `engine_args.data_parallel_size` (DP)
|
||
- `engine_args.decode_context_parallel_size` (dcp)
|
||
- `engine_args.prefill_context_parallel_size` (pcp)
|
||
- `engine_args.enable_dbo` (dual batch overlap)
|
||
- `engine_args.ubatch_size`
|
||
- `engine_args.dbo_decode_token_threshold`
|
||
- `engine_args.dbo_prefill_token_threshold`
|
||
- `engine_args.all2all_backend`
|
||
|
||
For a Qwen3-235B MoE model on 8x H20 where **DP=2 with TP=4 is an obvious and important candidate**, excluding `data_parallel_size` from the allowed config paths is a severe search space truncation. The exploration policy instructs the agent to "vary major knobs by large factors early (tp/pp/dp/pcp/dcp)" — but the agent literally cannot set `dp` or `pcp` or `dcp` without violating the "use only these paths" rule.
|
||
|
||
**Fix:** Either expand the allowed config paths to cover all search space entries, or explicitly explain why certain axes are excluded (e.g., "PP is not tunable in this deployment because the orchestration layer controls it"), so the agent doesn't waste reasoning cycles on knobs it cannot actually set.
|
||
|
||
---
|
||
|
||
## 4. Contradictory Exploration Policies
|
||
|
||
The exploration policy contains two directly opposing directives:
|
||
|
||
> "Start with aggressive exploration: vary major knobs by large factors early..."
|
||
|
||
> "If a human baseline exists, start with small deltas from the baseline config before large jumps."
|
||
|
||
A human baseline exists. Both rules therefore apply simultaneously with opposite prescriptions. The agent must choose one implicitly, with no tiebreaker. In practice the agent picked the conservative path (running the baseline first) — but it gets no credit or guidance for this choice, and a different instantiation might go wide immediately. This produces non-deterministic, non-auditable behavior.
|
||
|
||
**Fix:** Add a clear precedence rule: "When a human baseline exists, always run it first to establish a score, then proceed with large-factor exploration on the most impactful axes. Small-delta refinement is reserved for the final convergence phase after at least N diverse runs."
|
||
|
||
---
|
||
|
||
## 5. `proposed_runs_successful` Counter Logic Is Semantically Broken
|
||
|
||
After the baseline run completes successfully, the updated state shows `proposed_runs_successful: 0`. This is because the baseline is apparently not counted as a "proposed" run (it gets `kind: baseline` and `proposal_eval.status: "no_baseline"`). But the agent is never told this. The `run_trace_replayer` requirement says "run_trace_replayer requires hardware, model, and workload profiles" — all of which are now satisfied — yet the counter that represents the agent's sense of "how much has been accomplished" is stuck at zero.
|
||
|
||
This corrupts the agent's self-model. When it reads `proposed_runs_successful: 0`, it may incorrectly infer it hasn't yet run anything meaningful, affecting its decisions about when to stop, when to validate on full trace, and when to be confident in its analysis.
|
||
|
||
**Fix:** Clearly define what increments `proposed_runs_successful` (only non-baseline runs? only runs that beat the current best?), and provide a separate `baseline_runs_completed` counter. Alternatively, rename it to `agent_proposed_runs_successful` and document explicitly that the baseline doesn't count.
|
||
|
||
---
|
||
|
||
## 6. SLO Formula Has an Ambiguous and Apparently Inconsistent Variable `L`
|
||
|
||
The SLO is stated as: `TTFT p95 <= 0.001 * L + 1.0 s`. The variable `L` is never defined in the prompt. From context it seems to be prompt length, but:
|
||
|
||
- Is `L` per-request (computed dynamically per request), or the workload-level p95 prompt token count?
|
||
- The workload profile shows `p95_prompt_tokens = 7697`. If `L = 7697`, then `SLO = 0.001 * 7697 + 1.0 = 8.697 s`.
|
||
- But the baseline result shows `goodput_search.ttft_slo: 60.0`, implying L ≈ 59,000 tokens — nowhere near the p95 prompt length.
|
||
|
||
This is a critical inconsistency. The agent has no way to derive the correct SLO threshold, verify whether the 60-second figure is right, or reason about how close specific requests are to the SLO boundary. The `ttft_slo_pass_rate: 0.9604`in the baseline is close to the 0.95 threshold — small errors in SLO interpretation could lead the agent to wrongly classify configs as passing or failing.
|
||
|
||
**Fix:** Define `L` explicitly (e.g., "L is the input prompt token count for each individual request, evaluated per-request"), show a worked example with the trace's actual distribution, and explain how the binary search selects the `time_scale` that maximizes load while keeping pass_rate ≥ 0.95.
|
||
|
||
---
|
||
|
||
## 7. `time_scale` Semantics Are Inverted and Underdocumented
|
||
|
||
The user task says `Default time_scale=2.0`. The agent calls `profile_workload` with `time_scale=2.0`. The exploration policy says "time_scale is controlled by the tuner's binary search; do not set benchmark_overrides.time_scale." Yet the baseline result shows `time_scale: 0.5859375` and `goodput_search.selected_scale: 0.5859375` — far from 2.0 and going in the opposite direction.
|
||
|
||
The relationship between the task-level `time_scale` parameter (which appears to compress timestamps to speed up replay) and the goodput-search `selected_scale` (which appears to be the binary search variable that scales arrival rate to find the saturation point) is never explained. The agent is left with two `time_scale` values with no explanation of how they relate. This makes it impossible for the agent to reason correctly about workload intensity, request rate scaling, and what the goodput binary search is actually doing.
|
||
|
||
**Fix:** Add a dedicated section explaining the two-level time scaling architecture: (1) the fixed trace preprocessing `time_scale` (set once, compresses the original trace timeline), and (2) the binary-search `selected_scale` (runtime scaling factor applied during goodput search to find the maximum sustainable load). Show the math connecting `goodput_qps = overall_qps * selected_scale` and how they compound.
|
||
|
||
---
|
||
|
||
## 8. No Guidance for the Most Obvious High-Value Candidate: DP=2
|
||
|
||
The system has 8 GPUs. The baseline uses TP=4. This means 4 GPUs are idle from the perspective of model parallelism — the natural candidate is DP=2 (two replicas of TP=4), which doubles throughput capacity at the cost of zero per-GPU memory change. For a prefill-only workload with average 1875 tokens and p95 of 7697 tokens, DP=2 is likely the single highest-impact knob.
|
||
|
||
Neither the system prompt, exploration policy, knob guidance, nor bottleneck metrics hint at this. The agent has to derive it entirely from first principles — and may not, especially since `data_parallel_size` is absent from the allowed config paths (see issue 3).
|
||
|
||
**Fix:** Add DP to the allowed config paths. Add a note in the knob guidance: "For this system with `num_gpus=8` and `tensor_parallel_size=4`, `data_parallel_size=2` is feasible and typically the first candidate to explore for throughput maximization."
|
||
|
||
---
|
||
|
||
## 9. Non-Tunable Fields in Baseline `config` Arg Create Schema Confusion
|
||
|
||
The agent submits a `run_trace_replayer` call with a `config` that includes fields outside the allowed paths: `swap_space`, `max_model_len`, `kv_transfer_config`, `hf_overrides`, `kv_cache_dtype`, `enforce_eager`, `quantization`, `disable_hybrid_kv_cache_manager`, `enable_prefix_caching`, and dozens of non-tunable env vars. The rules say "Only set allowed config paths in tool args.config" — but for the baseline, it's natural to send the full config.
|
||
|
||
The rules don't clarify whether the baseline run is exempt from the allowed-paths restriction, whether extra fields are silently ignored or cause errors, or how to handle the baseline's non-tunable fields. This creates ambiguity that could cause later runs to fail if the agent continues sending non-tunable fields.
|
||
|
||
**Fix:** Add explicit instruction: "For the baseline run only, you may submit the complete baseline config. For all subsequent runs, use only the allowed config paths as a partial patch; the system will inherit unspecified fields from the previous successful run."
|
||
|
||
---
|
||
|
||
## 10. Workload Duration Mismatch Is Not Flagged or Explained
|
||
|
||
The task specifies `trace_minutes=10.0` as the default. The `profile_workload` call requested `trace_minutes=10.0`. But the result shows `duration_minutes: 4.999375` — the trace only contains about 5 minutes of data. This means the trace is shorter than the requested window, and the agent is silently running on a truncated workload. The constraint `max_trace_minutes=10.0` is nominally satisfied (5 < 10), but the agent has no way to know whether 5 minutes of data is sufficient for stable goodput estimates, whether the tail of the trace is being dropped, or whether using `sample_policy=head` vs `tail` would matter given the actual 5-minute duration.
|
||
|
||
**Fix:** When `profile_workload` returns a `duration_minutes` less than the requested `trace_minutes`, the system should emit an explicit warning and guidance: "Trace is shorter than requested window. Effective duration is X minutes. Consider whether sample policies are still meaningful and whether min_requests constraint is satisfied."
|
||
|
||
---
|
||
|
||
## 11. No Explicit Stopping Criterion or Run Budget
|
||
|
||
The prompt has no defined run budget (e.g., `max_runs=20`), no convergence threshold (e.g., "stop if no improvement > 2% in last 5 runs"), and no explicit timeout. The instruction "Before stopping, validate top configs on the full configured trace_minutes" is a stopping prerequisite but not a stopping trigger. The agent has no principled termination condition, which leads to either premature termination (being conservative about costs) or open-ended exploration.
|
||
|
||
**Fix:** Add explicit budget parameters: `max_proposed_runs`, `convergence_patience` (stop if best score unchanged for N runs), and `validation_runs_required` (how many top-K configs to validate before issuing `action=final`).
|
||
|
||
---
|
||
|
||
## Summary Table
|
||
|
||
|#|Category|Severity|Impact|
|
||
|---|---|---|---|
|
||
|1|block_size constraint vs. baseline contradiction|Critical|Agent cannot trust constraint table|
|
||
|2|Diagnostic metrics all zero, O→I→A unfulfillable|Critical|Core reasoning chain broken|
|
||
|3|Search space vs. allowed paths gap (DP, PP, dcp, pcp, DBO missing)|High|Major axes of optimization unreachable|
|
||
|4|Contradictory exploration policies|High|Non-deterministic agent behavior|
|
||
|5|`proposed_runs_successful` semantics broken|Medium|Corrupted agent self-state|
|
||
|6|SLO formula: L undefined, ttft_slo=60s inconsistency|High|SLO pass/fail classification unreliable|
|
||
|7|time_scale two-level semantics undocumented|High|Goodput reasoning is opaque|
|
||
|8|No DP=2 guidance for 8-GPU system|High|Most impactful knob unexplored|
|
||
|9|Non-tunable fields in config arg, no baseline exemption rule|Medium|Schema confusion in future runs|
|
||
|10|Workload duration 5min vs 10min request, silent truncation|Medium|Unstable goodput estimates possible|
|
||
|11|No stopping criterion or run budget|Medium|Open-ended or premature termination|
|
||
|
||
The deepest structural problem is the combination of issues 2 and 3: the agent is required to build its reasoning chain on bottleneck signals that are never populated, using configuration knobs that are not in the allowed paths. Everything else is fixable with documentation clarifications, but these two require either infrastructure changes (enabling diagnostic collection, expanding allowed paths) or explicit fallback policies. |