Files
obsidian/period/daily/26/260307.md

51 KiB
Raw Permalink Blame History

来自大模型的 AITuner 优化建议:

GPT

这个 prompt 已经比很多“让 LLM 直接调参”的做法强很多了它至少给了目标、约束、baseline、search space、工具、输出 schema、最近状态和 ledger 占位。 但如果你的目标是做一个足够好、足够有深度的 agentic AITuner system那么它目前最大的不足不是“信息太少”这么简单而是

  1. context 的层次还不够“因果化”,更像一个大号任务说明书,而不是一个能支持系统性推理的状态表示。
  2. 若干关键上下文缺失,导致 agent 很难形成真正的 bottleneck-driven tuning。
  3. prompt 内部存在多处冲突/不一致,会让 agent 学到坏行为,甚至直接产生非法 proposal。
  4. 还没有把“实验科学”的结构显式化:假设、证据、归因、置信度、复验条件、可转移经验都没有成为一等公民。

下面我分层分析。


1. 先给一个总判断:这个 prompt 目前更像“tool-using search agent”还不是“research-grade tuning agent”

它现在支持的是:

  • 先 profile hardware / model / workload
  • 再 run benchmark
  • 根据 recent runs 继续调参
  • 最终给 recommended config

这已经是一个可工作的 optimization loop。 但一个更深入的 agentic AITuner不只是“会调用工具 + 会改几个 knob”而是应该具备下面四层能力

  • State abstraction:把系统状态抽象成可推理的 bottleneck state而不是只看散落指标。
  • Hypothesis management:每次实验都带着明确假设,实验后更新 belief。
  • Search control:知道什么时候 exploration什么时候 exploitation什么时候 stop什么时候做 ablation / validation / re-test。
  • Transfer / memory:能够把这次 trace、这个 model、这类硬件上的经验迁移到下一轮决策。

你这个 prompt 目前主要覆盖了第一层的一小部分和第三层的一部分,但第二层、第四层基本还没有真正建立起来。


2. 最大问题一prompt 里给的是“配置上下文”,但缺少“系统状态上下文”

这是最核心的问题。

现在 prompt 里信息最多的是:

  • baseline config
  • tunable config
  • search space docs
  • tools
  • constraints
  • recent runs

但一个 tuning agent 真正需要的,不是“参数说明书”,而是系统当前处于什么 regime。 换句话说LLM 不是缺 knob而是缺 state

2.1 缺少端到端 bottleneck decomposition

你现在给的 bottleneck metrics 列表是:

  • queue ratio
  • waiting / running
  • kv cache usage
  • prompt tokens/s
  • generation tokens/s
  • preemption rate
  • prefill/decode ratio
  • request latency

这些是有帮助的,但还远远不够。 因为它们还没有把瓶颈拆到哪一类资源在限制 goodput

  • GPU compute bound
  • HBM / KV capacity bound
  • scheduler / batching bound
  • MoE communication bound
  • host-side launch / Python / runtime overhead bound
  • network bound
  • tail-latency / long-prompt dominated bound
  • admission-control / queueing dominated bound

也就是说agent 看到的是“指标”,看不到“结构化 bottleneck hypothesis”。

更好的做法是prompt 里直接提供一个经过后处理的 DerivedSystemState,例如:

{
  "derived_state": {
    "service_regime": "prefill_dominated",
    "likely_primary_bottleneck": "prefill_compute_or_batching",
    "likely_secondary_bottleneck": "moe_comm_disabled_but_model_is_moe",
    "queueing_pressure": "high",
    "kv_capacity_pressure": "low",
    "slo_headroom": "moderate",
    "parallelism_headroom": {
      "tp_up_possible": false,
      "dp_up_possible": true,
      "ep_candidate": true
    },
    "confidence": 0.63
  }
}

这会极大提高 LLM 的推理质量。 因为 LLM 最擅长处理语义化、中层抽象,不擅长从几十个原始指标里自己稳定地归纳出系统 regime。


2.2 缺少资源利用率与瓶颈定位的关键指标

你现在没有给:

  • GPU SM utilization / active cycles
  • HBM bandwidth utilization
  • NVLink / PCIe / NIC utilization
  • MoE all2all latency / bytes / imbalance
  • CPU utilization / launcher overhead
  • scheduler idle gaps / step interval jitter
  • CUDA graph hit rate / replay hit rate
  • kernel-level top hotspots
  • batch composition statistics每步 batch 的 seq 数 / token 数 / prefill 占比)

对于一个 serious 的 agentic AITuner这些信息非常关键。

例如你的 workload 明显是:

  • avg decode = 1
  • p95 decode = 1
  • prefill-only SLO
  • avg prompt 1875p95 7697

这已经强烈暗示是 prefill-heavy / long-prompt dominated 场景。 但 agent 仍然无法区分下面几种完全不同的现实:

  1. GPU prefill matmul 已经接近打满
  2. batch 组不起来GPU 常空转
  3. TTFT 被队列和调度延迟主导
  4. 长 prompt chunking 策略造成额外切分开销
  5. MoE expert dispatch 成本很高但未显式观察到
  6. KV 不是瓶颈,但 block fragmentation 影响 admission

如果没有更底层的 profiling stateagent 往往只能凭经验猜 knob。


2.3 缺少“config → mechanism → metric”的因果桥梁

现在 prompt 里 search space table 说了“这个 knob 控制什么”,但没有给 agent 足够明确的机制级解释。 例如:

  • enable_expert_parallel 改变的是 MoE layer sharding 和 communication pattern
  • block_size 改变的是 KV allocation granularity / fragmentation / paging overhead
  • max_num_batched_tokens 改变的是 batching aggressiveness 和 TTFT / throughput tradeoff
  • max_num_seqs 改变的是 scheduler occupancy / fairness / tail latency
  • cuda_graph_sizes / compilation_config 改变的是 launch overhead / graph capture coverage

这些目前是用自然语言弱描述的。 但一个更深的 agent 需要 prompt 里直接给出更明确的 causal priors比如

{
  "knob_priors": {
    "engine_args.enable_expert_parallel": {
      "helps_when": [
        "is_moe=true",
        "moe_comm_or_expert_sharding_is_bottleneck",
        "tp-only causes poor expert locality or scaling"
      ],
      "hurts_when": [
        "interconnect is weak",
        "batch is too small",
        "all2all cost dominates"
      ],
      "observable_signals": [
        "high prefill time with moe model",
        "poor scaling from tp increase",
        "queue pressure while GPU util not saturated"
      ]
    }
  }
}

这会让 agent 不只是“试试 EP”而是知道为什么此时该试 EP


3. 最大问题二prompt 缺了很多对 tuning 决策真正关键的上下文

下面这些缺失,会显著限制 agent 的深度。

3.1 缺少 baseline 的“完整实验语义”

你给了 baseline config但没给 baseline 的来源和语义标签:

  • 这是人工经验最优?
  • 这是线上默认?
  • 这是某个历史最好?
  • 这是安全保守配置?
  • 这是某种 workload class 下的最佳点?

这很重要。因为 agent 对 baseline 的利用方式依赖 baseline 的身份。

例如:

  • 若 baseline 是“线上稳态保守配置”agent 应更积极探索。
  • 若 baseline 是“专家为同类 trace 精调后的配置”agent 应优先做小 delta 和局部 ablation。
  • 若 baseline 是“另一类 workload 的最优配置”agent 应降低其先验权重。

建议在 prompt 中加入:

{
  "baseline_meta": {
    "source": "human_expert",
    "intended_workload_class": "prefill-heavy-long-context-moe",
    "trust_level": "high",
    "optimization_goal": "goodput_under_ttft_slo",
    "known_limitations": ["not tuned for H20 PCIe topology"]
  }
}

3.2 缺少 workload shape 的更细粒度统计

现在 workload profile 太粗,只给了:

  • qps
  • avg/p95 prompt
  • avg/p95 decode
  • p99 total tokens

这不够。 真正该给 agent 的 workload state 至少应包括:

  • prompt length histogram / bucket distribution
  • arrival burstinessCV、p95 inter-arrival gap、burst windows
  • request size correlation长 prompt 是否集中出现)
  • time-local nonstationaritytrace 前后半段是否不同)
  • percent of ultra-long prompts
  • prefix reuse potential
  • effective concurrency envelope under time_scale search
  • head/tail windows 的差异

因为许多调参策略高度依赖 workload shape。 例如:

  • 如果少量超长 prompt 决定 p95 TTFT那么提高 max_num_batched_tokens 可能恶化 tail。
  • 如果 workload burstymax_num_seqs 和 queueing 策略影响会很大。
  • 如果 prefix reuse 高prefix caching 的价值大;若几乎无复用,那就不该把它当作重要因素。

建议 profile_workload 输出升级成:

{
  "prompt_len_buckets": {"0-512": 0.12, "512-2k": 0.44, "2k-8k": 0.36, "8k+": 0.08},
  "interarrival_cv": 1.87,
  "burstiness_score": 0.74,
  "prefix_reuse_estimate": 0.03,
  "long_prompt_tail_share": 0.11,
  "stationarity": {
    "head_vs_tail_qps_delta_pct": 22.0,
    "head_vs_tail_prompt_p95_delta_pct": 35.0
  }
}

3.3 缺少 model-serving co-design 所需的 model internals

你已经有:

  • is_moe
  • num_experts
  • num_experts_per_tok
  • num_kv_heads
  • layers
  • heads

但还缺很多对 serving 决策很关键的模型结构信息:

  • 每层 expert hidden dims / FFN expansion
  • attention FLOPs vs MoE FLOPs 占比估算
  • parameter per layer / expert size
  • activation memory estimate per token
  • prefill vs decode compute ratio estimate
  • router imbalance risk / top-k gating characteristics
  • KV cache bytes per token per TP config
  • TP / EP / DP 下的理论 memory & comm model

一个 research-grade agent 应该看到的不只是“这是 MoE”而是

  • 这个模型在这个 TP/EP 组合下,理论上 compute / comm / memory 怎么变
  • 哪种并行最可能改善 goodput
  • 哪个 knob 改了以后,大概影响哪一类成本

建议显式加入 AnalyticalProfile

{
  "analytical_profile": {
    "estimated_prefill_flops_per_token": ...,
    "estimated_decode_flops_per_token": ...,
    "estimated_kv_bytes_per_token_per_gpu": {
      "tp1": ...,
      "tp2": ...,
      "tp4": ...,
      "tp8": ...
    },
    "moe_dispatch_cost_risk": "high",
    "ep_candidate_strength": "strong",
    "tp_scaling_limit_hint": "pcie_interconnect"
  }
}

这类 analytical prior 很重要,它能把 agent 从“纯试验主义”提升成“模型 + 试验混合驱动”。


3.4 缺少 action space 的完整语义和局部依赖

你的 allowed config paths 很有限,这本身没问题;问题在于 agent 不知道:

  • 哪些 knob 是“高影响主旋钮”
  • 哪些是“条件旋钮”
  • 哪些组合必须联动
  • 哪些变化应该一起做,哪些应该单变量 ablation
  • 某个 knob 的调整步长建议是什么

比如:

  • enable_expert_parallel=true 时,通常应联动 all2all_backend
  • 但你的 allowed config paths 里居然没有 engine_args.all2all_backend
  • prompt 文本里却又一直提它

这会直接造成 agent 的 action space 语义断裂: 它知道这个 knob 重要,但又不能设置。

这是一个非常典型的“prompt context incomplete / inconsistent”问题。


4. 最大问题三prompt 内部有多处自相矛盾,这会严重污染 agent 的决策

这一点非常关键。 比“少给点信息”更糟的是“给了互相冲突的信息”。

你的例子里至少有以下几个硬伤。


4.1 block_size 约束和 baseline 直接冲突

你给的 global rule 说:

  • CUDA only supports block_size <= 32
  • search space enum {16, 32}
  • knob guidance 里也说 never use 8

但 baseline config 里却是:

"block_size": 64

而且 baseline run 也真的用 64 跑通了。

这会给 agent 带来极大混乱:

  • 到底 64 是非法值,还是文档过期?
  • 如果 baseline 用 64 成功,是不是规则错了?
  • 如果规则是硬约束,为什么系统自己允许?
  • 那 agent 之后是否还该把 64 当合法候选?

这不是小问题。 constraint inconsistency 会直接破坏 LLM 的 rule-following reliability。

你应当保证三者完全一致:

  • 文档约束
  • allowed config paths / domains
  • 实际运行系统

如果系统实际上支持 64那么 prompt 必须更新; 如果 64 只是 legacy artifact那么 baseline 需要标准化重写。


4.2 强制 O→I→A但首轮又没有 bottleneck metrics

你定义了严格 requirement

Every run_trace_replayer proposal.selection_logic MUST follow exactly this structure Observation: cite at least 3 numeric metrics from recent bottleneck signals.

但在 first run 之前state 明确写着:

  • recent runs: none
  • bottleneck signals: none

这意味着首轮 baseline/seed run 按规则根本无法合法构造 proposal。

结果你的 agent 实际上也没遵守这个要求,而是给了:

  • selection_logic: "human_baseline"

这说明 prompt 的 rule 在 cold start 不可满足。

解决方法不是“让模型聪明点”,而是要把 protocol 写成两阶段:

  • Stage 0 / bootstrap:允许无 OIA 的 profiling run / baseline run
  • Stage 1+:必须使用 recent numeric evidence 构造 OIA

例如:

Cold-start exception:
- If there is no successful benchmark run yet, you may execute one bootstrap run
  without OIA evidence.
- After at least one successful run with diagnostics, all future runs must use OIA.

4.3 run_trace_replayer requires hardware, model, workload profiles但没有要求“diagnostics availability”

你后续决策高度依赖 bottleneck metrics但 baseline 结果里:

  • diag_engine_sample_count = 0
  • diag_resource_sample_count = 0

也就是说benchmark 产物实际上没有 diagnostic samples。 这会导致后续 OIA requirement 依然不可满足。

一个更健壮的 prompt 必须显式处理:

  • 如果 diagnostics 缺失agent 应怎么做?
  • 是否允许 fallback 到 aggregated metrics
  • 是否需要触发一个诊断模式 run
  • 是否需要降低结论置信度?

建议加入:

If diagnostic metrics are unavailable or sample count is 0:
- mark bottleneck confidence as low;
- do not claim a specific bottleneck unless supported by aggregate evidence;
- prioritize runs that improve observability or isolate hypotheses.

4.4 all2all_backend 被反复提及,但不在 allowed config paths

文本中多次强调:

  • EP 要搭配 all2all_backend
  • backend choices 有 deepep_high_throughput / low_latency 等

但 allowed config paths 里没有:

  • engine_args.all2all_backend

这会让 agent 卡在半残状态:知道该调,实际上不能调。

这会直接削弱 MoE/EP 调优能力,尤其你这个模型还是 MoE且 prompt 明确强调它。


4.5 search space 里有 TP/PP/DP/PCP/DCP但 allowed paths 实际只给了 TP/DP

你在 exploration policy 里说:

  • early vary major knobs by large factors early (tp/pp/dp/pcp/dcp, ...)

但 allowed config paths 里没有:

  • pipeline_parallel_size
  • prefill_context_parallel_size
  • decode_context_parallel_size

这会制造“虚假 action space”

  • system prompt 告诉 agent这些是 major knobs
  • actual action schema 又不允许设置

这种不一致非常伤。 LLM 会把很多 token 预算花在思考一个它根本不能控制的维度。


4.6 kind 说明和实际调用不一致

你定义:

  • kind: seed | iter (optional)

但 agent 实际调用用了:

  • "kind": "baseline"

如果真实 executor 接受 baselineschema 就不对; 如果 schema 对agent 就在 violation。

这种 schema drift 会让 agent 学不到稳定 protocol。


4.7 “keep trace_minutes fixed” 与 “before stopping, validate top configs on full configured trace_minutes” 其实表达含混

你说:

  • Keep trace_minutes and max_requests fixed to configured values; do not override
  • Before stopping, validate top configs on the full configured trace_minutes

但当前 configured trace_minutes 本来就是 10.0profile_workload 结果却显示 duration_minutes 约 5.0run 里又有 binary search on time_scale。

这里会让 agent 不够确定:

  • 它看到的 workload profile 是不是已经被 filters/time_scale 改过?
  • baseline 的 goodput 是在 full trace 上,还是某个 effective scale 上?
  • “full configured trace_minutes” 和 “binary searched time_scale” 的关系是什么?

建议明确区分:

  • Trace window:固定 10 分钟 head window
  • Arrival scaling:由 system-controlled binary search 改变 load
  • Evaluation protocol:所有 config 均在相同 window 上比较goodput 由 system 的 time_scale search 输出

否则 agent 很容易误解 baseline 的 comparability。


5. 最大问题四:当前 prompt 还没有显式支持“科学化实验流程”

一个足够有深度的 agentic AITuner不应该只是“跑配置 → 看好不好”。 它应该像一个系统研究者一样维护实验逻辑。

你现在缺少以下几个关键对象。


5.1 缺少“假设对象”

每次 run 目前只有:

  • selection_logic
  • why_better
  • expected_improvements

这还不够。 应该让每次实验都携带一个显式 hypothesis例如

{
  "hypothesis": {
    "id": "H-ep-prefill-moe-001",
    "statement": "For this prefill-heavy MoE workload on 8xH20 PCIe, enabling EP with a high-throughput all2all backend will improve goodput more than increasing TP, because MoE sharding and communication pattern dominate prefill cost.",
    "depends_on": ["model.is_moe", "decode_tokens≈1", "queue_pressure_high"],
    "falsifiable_by": [
      "goodput gain < 3%",
      "ttft p95 worsens > 5%",
      "queue ratio unchanged while engine prompt throughput decreases"
    ]
  }
}

这样 agent 才能做真正的 belief update而不是只做 trial-and-error。


5.2 缺少“实验设计类型”

不是所有 run 都是同一种 run。 你需要区分:

  • bootstrap baseline
  • broad exploration
  • local refinement
  • ablation
  • confirmation / re-test
  • transfer test
  • robustness validation

如果 prompt 不区分这些agent 容易:

  • 过早 exploit
  • 反复测试相似 config
  • 把 noisy improvement 当成真 improvement
  • 不知道什么时候该做复验

建议每个 run 明确 experiment_type


5.3 缺少“结果置信度”和“噪声模型”

goodput benchmark 不是 deterministic oracle。 你现在没有告诉 agent

  • 同一 config 的 run-to-run variance 大概多少
  • cache hit / warmup 是否影响结果
  • 什么幅度的 improvement 才算显著
  • 何时需要 re-run 才能确认

于是 agent 会把 1~2% 的 gain 当真。

建议 prompt 里加入:

{
  "evaluation_policy": {
    "min_effect_size_pct_for_claim": 3.0,
    "rerun_if_gain_between_pct": [1.0, 5.0],
    "rerun_if_slo_margin_small": true,
    "prefer_ablation_for_multi_knob_gain": true
  }
}

5.4 缺少“derived memory / exploration ledger”的结构化表达

你自己前面就已经意识到这一点了,这里正好能对应上。

当前 prompt 只有:

  • failed config ledger: none
  • recent runs: ...
  • current baseline values ...

这远不够。 真正有效的 ledger 不只是“失败配置列表”,而是至少要有:

  • tried configs
  • outcome
  • dominant observed signals
  • inferred cause
  • confidence
  • neighborhood summary
  • equivalence / redundancy relation
  • what has been falsified

例如:

{
  "exploration_ledger": {
    "tested_regions": [
      {
        "region_id": "R1",
        "predicate": "tp=4, ep=false, block_size in {32,64}, max_num_batched_tokens in [4k,12k]",
        "best_run": "run_07",
        "finding": "batching beyond 8k increases tail TTFT faster than throughput",
        "confidence": 0.71
      }
    ],
    "falsified_hypotheses": [
      {
        "hypothesis_id": "H-tp-up-001",
        "reason": "tp increase reduced effective goodput under PCIe topology"
      }
    ],
    "best_known_config": {...},
    "redundant_proposals_to_avoid": [...]
  }
}

这会让 agent 从“短期记忆搜索”变成“显式科研过程”。


6. 从这个具体例子出发prompt 已经暴露出哪些不足

你给的 transcript 本身就很说明问题。


6.1 agent 首轮只做了 baseline replay没有真正利用 workload/model/hardware 做初始 hypothesis

在拿到 profile 之后,系统其实已经知道很多强信号:

  • 8x H20PCIe topology
  • Qwen3-235B-A22BMoE128 expertstop-8
  • num_kv_heads = 4
  • prefill-heavydecode≈1
  • prompt 很长p95 7697
  • objective 是 TTFT-based goodput

一个更强的 agent 在这里应该立刻形成若干高价值 hypothesis

  • 这是 prefill-dominated MoE 长 prompt 场景,max_num_batched_tokens / max_num_seqs / EP 应优先考虑
  • decode-side knobs 几乎不重要
  • PCIe 拓扑下 TP 扩大不一定划算EP 的收益依赖 comm backend
  • block size 与 KV fragmentation 可能相关,但 decode≈1 时收益未必最大
  • chunked prefill 行为应该非常关键

但当前 prompt 没有要求 agent 显式写出这些 hypotheses。 于是 agent 只是机械地 benchmark baseline。

这不是 agent 不聪明,而是 prompt 没要求“先形成问题表征”。


6.2 agent 复制了大量非 allowed / 非必要字段,说明 action schema 不够清晰

run_trace_replayer 的 config patch 里出现了大量其实不在 allowed config paths 中的字段:

  • swap_space
  • enable_prefix_caching
  • disable_hybrid_kv_cache_manager
  • kv_transfer_config
  • hf_overrides
  • kv_cache_dtype
  • enforce_eager
  • quantization
  • 一堆 envs

这说明 prompt 的 action contract 没真正压缩成一个最小、清晰、可控的 proposal language。 agent 实际上在复刻 baseline 全量 config而不是只操作 allowed knobs。

一个更好的系统应该把 config 明确拆成:

  • immutable baseline context
  • mutable tunable patch

并只让 agent 输出 patch。 否则 LLM 很容易把整个 baseline 抄一遍,导致:

  • action 冗长
  • 容易引入不必要 drift
  • 不利于 ledger 去重和 neighborhood reasoning

6.3 baseline 结果缺诊断,但 prompt 没有 fallback strategy

这是这个例子里最大的 operational flaw 之一。

baseline 返回:

  • diag_engine_sample_count = 0
  • diag_resource_sample_count = 0

意味着后续 agent 根本拿不到你要求的 numeric bottleneck evidence。 如果系统不在 prompt 里处理这件事agent 后续只能:

  • 违背规则硬猜
  • 或停住
  • 或不断做拍脑袋搜索

这说明 prompt 缺少“观测失败 / 诊断缺失时的控制流”。


7. 对一个更成熟的 agentic AITunerprompt 应该怎么升级

下面我给的是更系统化的建议,不只是“补几个字段”。


7.1 把 prompt 从“任务描述”升级成“状态机输入”

你现在的 prompt 是一大段文字 + 当前状态。 更好的做法是把它组织成几个稳定的 JSON section

{
  "task_spec": {...},
  "system_state": {...},
  "search_space": {...},
  "observability_state": {...},
  "memory_state": {...},
  "decision_policy": {...},
  "output_contract": {...}
}

这样 LLM 更容易做 consistent parsing也更利于后续蒸馏/训练。


7.2 显式加入 DerivedSystemState

这是最重要的增强之一。 建议不是只给 raw profile而是额外给派生状态

  • workload regime
  • bottleneck hypothesis candidates
  • resource pressure summary
  • SLO headroom
  • exploration stage
  • confidence

这相当于给 LLM 一个“中间表征”。


7.3 让每次 run 必须带 hypothesis 和 experiment type

建议输出 schema 从现在的:

  • selection_logic
  • why_better
  • expected_improvements

升级为:

{
  "experiment_type": "bootstrap|explore|ablation|refine|confirm|validate",
  "hypothesis": {...},
  "proposal": {...},
  "expected_outcomes": {...},
  "stop_condition_update": {...}
}

这样整个过程才更像科学实验,而不是盲搜。


7.4 把 memory / ledger 从“日志”变成“可推理对象”

至少加这几块:

  • best_known_config
  • tested_regions
  • falsified_hypotheses
  • pending_hypotheses
  • config_equivalence_or_redundancy
  • recent_effect_estimates

这会显著减少重复试验。


7.5 加入“knob interaction priors”

像下面这种 priors 很值得直接放进 prompt

  • prefill-heavy + long prompts → prioritize max_num_batched_tokens, chunked prefill, EP for MoE
  • decode≈1 → de-emphasize decode-oriented tuning
  • PCIe topology → be cautious with communication-heavy scaling
  • MoE + EP enabled → backend choice is first-order knob
  • high queue ratio + low running mean → scheduling/admission/batching issue
  • high KV usage + preemption / waiting → capacity / fragmentation issue

LLM 很吃这种 high-level priors。


7.6 加入“非法 / 过时 / 冲突规则检查器”

在给 agent prompt 前,系统端应该先做 rule sanitization

  • baseline 是否违反 current constraints
  • search space 是否和 allowed paths 一致
  • docs 是否过时
  • output schema 是否和 tool schema 对齐

否则 prompt 会把系统自身的不一致传给 LLM。


8. 我会把目前这个 prompt 的不足,归纳成 8 个核心缺口

为了更便于你后面写 paper / internal memo我给一个比较“研究化”的概括。

缺口 1缺少中层状态抽象

现在主要是 raw config + raw metrics缺少 derived bottleneck state。

缺口 2缺少机制级因果先验

knob 作用写得太浅,没形成 config→mechanism→metric 的因果桥梁。

缺口 3缺少高价值 observability

没有 GPU / comm / scheduler / graph / kernel 等关键诊断。

缺口 4缺少结构化 hypothesis management

没有显式假设、证伪条件、belief update。

缺口 5缺少实验类型与统计置信度建模

没有区分 explore/refine/confirm也没有 effect size / rerun policy。

缺口 6缺少长期记忆与区域级探索摘要

只有 recent runs没有 tested regions / falsified directions。

缺口 7action space 与 search space 不一致

文档里说能调,实际 allowed paths 不能调。

缺口 8constraint / protocol 自相矛盾

block_size、OIA requirement、kind schema、diagnostics availability 都有冲突。


9. 如果只允许你优先改 5 件事,我建议按这个顺序来

第一优先级:修 prompt 内部一致性

先把这些对齐:

  • block_size 约束 vs baseline
  • allowed paths vs search space
  • all2all_backend 是否允许
  • kind schema
  • cold-start OIA 例外

这是地基。

第二优先级:加入 derived bottleneck state

不要只给 raw metrics要给

  • queueing pressure
  • kv pressure
  • comm pressure
  • likely bottleneck candidates
  • confidence

第三优先级:把 ledger 显式结构化

至少要有:

  • best known
  • falsified hypotheses
  • tested regions
  • repeated configs to avoid

第四优先级:把每次 run 变成 hypothesis-driven

引入:

  • experiment_type
  • hypothesis
  • falsifiable_by
  • confidence update

第五优先级:扩充 observability

至少补:

  • GPU util
  • comm bytes/latency
  • scheduler idle / batch fill
  • graph hit rate
  • batch composition

10. 一个更好的 prompt 骨架应该长什么样

不是完整重写,只给结构骨架:

{
  "task_spec": {
    "objective": "maximize goodput_qps_per_gpu",
    "slo": {...},
    "evaluation_protocol": {...}
  },
  "immutable_context": {
    "hardware_profile": {...},
    "model_profile": {...},
    "workload_profile": {...},
    "baseline_config": {...},
    "baseline_meta": {...}
  },
  "derived_state": {
    "workload_regime": "prefill-heavy-long-context",
    "system_regime": "queueing-sensitive",
    "bottleneck_candidates": [...],
    "confidence": ...
  },
  "search_space": {
    "allowed_paths": [...],
    "interaction_priors": {...},
    "hard_constraints": [...],
    "soft_guidance": [...]
  },
  "memory_state": {
    "best_known_config": {...},
    "tested_regions": [...],
    "falsified_hypotheses": [...],
    "recent_runs": [...]
  },
  "decision_policy": {
    "cold_start_policy": {...},
    "exploration_policy": {...},
    "refinement_policy": {...},
    "validation_policy": {...},
    "noise_policy": {...}
  },
  "output_contract": {
    "must_include": [
      "experiment_type",
      "hypothesis",
      "observation",
      "inference",
      "action"
    ]
  }
}

11. 最后给一个最凝练的结论

如果一句话概括:

你现在这个 prompt 最大的问题,不是“给 LLM 的信息不够多”,而是“给了很多配置与规则信息,但没有把这些信息组织成可做系统性归因和实验决策的状态表示”。

更具体地说:

  • 现在它擅长让 agent 做配置搜索
  • 但还不擅长让 agent 做系统研究式调优

而一个真正有深度的 agentic AITuner应该让 LLM看到的不是

  • “这里有 baseline、这里有 knob、你去试”

而是:

  • “当前系统处于什么 regime”
  • “最可能的瓶颈是什么”
  • “哪些假设已经证伪”
  • “这次实验是在验证什么”
  • “多大 improvement 才算真的 improvement”
  • “这个结果为什么能迁移到下一个决策”

这才是从“LLM 会调参”走向“agentic native system tuning”的关键一步。

如果你愿意,我下一步可以直接帮你做两件事里的一个:

  1. 把这份 prompt 重构成一个research-grade prompt schema v2
  2. 把上面的分析整理成一份适合你内部汇报/论文 related discussion 的中文备忘录。

Gemini

这是一个非常精巧且工程化程度极高的 Agentic Tuning 系统,尤其是结合了 Binary Search 测算 Goodput 以及明确的 O->I->A (Observation -> Inference -> Action) 推理链,展现了很深的系统级思考。

不过,从让 LLM 真正能够像顶尖 Systems Researcher 那样进行 deep reasoning 的角度来看,目前注入的 Context 和运行时的状态机仍存在一些逻辑断层和系统级上下文的缺失。以下是详细的不足分析及改善建议:

1. 致命的逻辑死锁:诊断指标缺失与 O-I-A 强制规则的冲突

这是目前 Prompt 中最容易导致 Agent 陷入幻觉或卡死的点。

  • 不足Prompt 的 Bottleneck analysis requirement (strict) 强制要求 Agent 必须引用至少 3 个来自近期瓶颈信号的 diag_engine_* 指标。然而,在 Baseline 的 tool_results 返回中,aggregated 里的 diag_resource_sample_count: 0.0 和 diag_engine_sample_count: 0.0,这意味着底层 trace_replayer 并未成功采集或暴露出关键的 profiling 指标(如 diag_engine_kv_cache_usage_perc_mean)。

  • 后果:由于底层没返回数据,而系统 prompt 又拿枪指着 Agent 的头必须填 3 个指标Agent 只能“胡编乱造”(幻觉)出 3 个看似合理的数字来满足 JSON schema这将彻底毁掉后续优化的科学性。

  • 改善

    1. 修复底层 Profiler:确保 run_trace_replayer 能够真正在异步后台定期 poll vLLM 的 engine metrics 并聚合成结果。

    2. 放宽冷启动规则:在 Prompt 中增加条件分支“If diagnostic metrics are missing or count is 0.0 (e.g., during initial baseline), you may skip citing them in the Observation and rely on latency/throughput metrics instead, but you MUST issue a config change to enable metrics profiling if available.”

2. 核心硬约束与 Baseline 配置的直接矛盾

  • 不足:在 Global rules (constraints) 和 Knob guidance 中明确写道“On CUDA devices, only block_size <= 32 is supported” 且 “block_size must be one of {16, 32}; never use 8”。但在给出的 Human baseline config 中,赫然写着 "block_size": 64

  • 后果:这种自相矛盾的 Context 会让大模型产生极大的困惑。如果它在后续 action 中保持 64就会违反你的 prompt 规则;如果它改为 32可能又会偏离你提供 baseline 的本意。

  • 改善:统一事实。如果是 Hopper/Ada 架构的某些新版 vLLM 已经支持 64 及以上的 block size事实上 vLLM 确实支持更大的 block size只是会影响显存碎片率请更新 Prompt 中的 Hard constraint移除 <= 32 的过时限制。

3. 缺乏关键的硬件拓扑与 MoE 通信上下文 (Systems Context)

Agent 要优化的是一个 235B 的大参数量 MoE 模型 (128 专家,激活 8 专家),而硬件环境是 8x NVIDIA H20, nvlink_topology: "pcie"。这是一个极其极端的配置。

  • 不足H20 在 PCIe 拓扑下GPU 之间的 P2P 带宽远低于 NVLink通常双向只有几十 GB/s。对于 MoE 而言,如果开启 Expert Parallelism (--enable-expert-parallel)All-to-All 的通信开销在 PCIe 上将是灾难性的。Agent 目前并不知道 nvlink_topology: "pcie" 意味着“极弱的跨卡通信带宽”。

  • 改善

    1. 在 Hardware 描述或 Knob guidance 中增加拓扑对性能影响的 heuristics。例如Topology Guidance: The current nvlink_topology is pcie. Be highly cautious with Expert Parallelism (EP) or large Tensor Parallelism (TP) crossing PCIe switches, as All-to-All communication will be severely bottlenecked. Prefer maximizing Data Parallelism (DP) or Context Parallelism (PCP) if memory permits.”

    2. 补充 DeepEP 的适用场景:环境变量中提供了多种 All2All backend如 deepep_low_latency),应告诉 Agent 哪些 backend 更适合 PCIe 低带宽环境。

4. “Goodput” 与 Binary Search 黑盒化导致推理变浅

  • 不足:系统的最终目标是最大化 goodput_qps_per_gpu。从日志可以看出,底层系统通过二分查找 time_scale(当前找到了 0.5859375)来施加不同的 QPS 压力,直到打破 TTFT SLO (TTFT_{p95} \le 0.001 \times L + 1.0)。但 Agent 对于“时间缩放系数是如何影响 Goodput 的”缺乏直观的认知逻辑。

  • 改善:在 Optimization policy 中用一句精炼的话点透背后的排队论Queueing Theory逻辑Goodput Physics: The tuner binary-searches the time_scale to find the maximum request arrival rate before the system queues saturate and break the TTFT SLO. To increase goodput, you must reduce per-request compute time (improving max capacity) or improve scheduling overlap (e.g., chunked prefill) to prevent queue buildup.” 这能引导 Agent 在 Inference 阶段思考排队延迟,而不仅仅是单次推理延迟。

5. 显存算账能力 (Memory Math) 的缺失

  • 不足:系统调优最大的红线是 OOM。对于 158B 活跃参数的 MoE 模型(假设 FP8 需要 ~160GB哪怕 8 张 96GB 卡总显存充裕,但如果做了 TP=4, DP=2 分片,每张卡还要承载巨大的 KV Cache。Agent 没有任何工具去预估当前配置下的 KV Cache 可用配额KV Cache Blocks 数目)。

  • 改善:提供一个 estimate_memory_footprint(config) 的伪工具,或者在 Baseline 的 metric 中暴露 gpu_kv_cache_budget_gb 或 num_gpu_blocks。这样 Agent 在扩大 max_num_seqs 或 max_num_batched_tokens 时,就能有理有据地判断是否会 OOM而不是瞎猜。

6. Search Space 暴露的不对称

  • 不足Allowed config paths 列表严重偏科。Baseline 配置中包含非常多深度的 vLLM 环境变量(如 ACCL_LOW_LATENCY_OPTIMIZEVLLM_MOE_USE_DEEPEP 以及各类 FP8 算子选项),但允许 Agent 修改的白名单却只开放了极少数几个(如 envs.VLLM_MOE_USE_DEEPEP 等 4 个 envs。如果这是一个深度 Tuning 系统,目前的白名单过于保守,限制了 Agent 发挥真正的“系统黑客”能力。

  • 改善:将更多关键的网络/通信层参数(特别是 NVSHMEM 和 ACCL 相关的参数,对于 H20 PCIe 极其重要)加入 Allowed paths或者明确在 Prompt 中声明“For environment variables not listed in Allowed config paths, you may request overriding them in the rationale, but the system will mock them for safety” 供后续收集人工规则。


需要我针对某一个点(比如如何重新设计 Bottleneck analysis requirement 的 Prompt 段落,让其兼顾强制力与冷启动的容错性)给出具体的修改示例吗?

Sonnet

This is a rich, complex prompt with several meaningful gaps. Here's a deep, structured critique across the major dimensions:


1. Hard Constraint Contradiction: block_size 64 vs. ≤32 Rule

The search space spec states unambiguously: "Hard constraint (CUDA): block_size <= 32. Always choose from allowed enum {16, 32}." The knob guidance reinforces: "block_size must be one of {16, 32}; never use 8."

Yet the human baseline uses block_size: 64, and the system runs it successfully. The agent never flags this. This creates a damaging inconsistency: the constraint table the agent is supposed to enforce says one thing, the baseline config the agent is told to inherit from says something different, and the actual benchmark silently accepts it. The agent is now in an epistemically broken position — it cannot know whether block_size=64 is a special platform override, a documentation lag, or a bug. It has no mechanism to resolve this, no tool to query valid block sizes at runtime, and no instruction to challenge the baseline when it violates a stated hard constraint.

Fix: Either update the constraint table to reflect the actual allowed values for this specific platform, or add a validate_config tool (or response from get_constraints) that resolves the ground-truth allowed enum dynamically. The agent should also receive explicit instruction: "If the baseline config violates a stated constraint, flag it and query for resolution before proceeding."


2. Bottleneck Signal Infrastructure is Broken, Making O→I→A Structurally Unfulfillable

The prompt mandates a strict Observation→Inference→Action chain for every run_trace_replayer call, requiring citation of at least 3 numeric metrics from diag_engine_* signals. But the baseline result shows:

diag_engine_sample_count: 0.0
diag_resource_sample_count: 0.0
diag_metric_collection_error_count: 0.0

Every single diagnostic metric is zero. There is no data in any diag_engine_* field. The policy requirement is now permanently unfulfillable in honest form — the agent is being asked to cite real bottleneck evidence that doesn't exist. This forces the agent to either fabricate the O→I→A chain (a quality and reliability failure) or deadlock waiting for data that never arrives.

The prompt provides no explanation of why diagnostics might be empty (metric collection agent not running? sampling interval mismatch? scraping disabled?), no fallback policy when diagnostic data is unavailable, and no tool to diagnose the diagnostic failure itself.

Fix: Add a fallback policy: "If diag_engine_sample_count == 0, use workload profile statistics (avg/p95 prompt tokens, qps, ttft distribution) and run results (latency percentiles, queue ratio proxies) as the Observation basis, and note that engine diagnostics are unavailable." Also add explicit guidance on what causes missing diagnostics and how to request metric collection be enabled.


3. Search Space Table vs. Allowed Config Paths: Massive Coverage Gap

The search space table documents 15 tunable parameters across parallelism, context-parallel, MoE, overlap/dual-batch, KV cache, batching, and prefill categories. The "Allowed config paths" section permits only 14 paths in total. The following search space parameters are completely absent from allowed config paths:

  • engine_args.pipeline_parallel_size (PP)
  • engine_args.data_parallel_size (DP)
  • engine_args.decode_context_parallel_size (dcp)
  • engine_args.prefill_context_parallel_size (pcp)
  • engine_args.enable_dbo (dual batch overlap)
  • engine_args.ubatch_size
  • engine_args.dbo_decode_token_threshold
  • engine_args.dbo_prefill_token_threshold
  • engine_args.all2all_backend

For a Qwen3-235B MoE model on 8x H20 where DP=2 with TP=4 is an obvious and important candidate, excluding data_parallel_size from the allowed config paths is a severe search space truncation. The exploration policy instructs the agent to "vary major knobs by large factors early (tp/pp/dp/pcp/dcp)" — but the agent literally cannot set dp or pcp or dcp without violating the "use only these paths" rule.

Fix: Either expand the allowed config paths to cover all search space entries, or explicitly explain why certain axes are excluded (e.g., "PP is not tunable in this deployment because the orchestration layer controls it"), so the agent doesn't waste reasoning cycles on knobs it cannot actually set.


4. Contradictory Exploration Policies

The exploration policy contains two directly opposing directives:

"Start with aggressive exploration: vary major knobs by large factors early..."

"If a human baseline exists, start with small deltas from the baseline config before large jumps."

A human baseline exists. Both rules therefore apply simultaneously with opposite prescriptions. The agent must choose one implicitly, with no tiebreaker. In practice the agent picked the conservative path (running the baseline first) — but it gets no credit or guidance for this choice, and a different instantiation might go wide immediately. This produces non-deterministic, non-auditable behavior.

Fix: Add a clear precedence rule: "When a human baseline exists, always run it first to establish a score, then proceed with large-factor exploration on the most impactful axes. Small-delta refinement is reserved for the final convergence phase after at least N diverse runs."


5. proposed_runs_successful Counter Logic Is Semantically Broken

After the baseline run completes successfully, the updated state shows proposed_runs_successful: 0. This is because the baseline is apparently not counted as a "proposed" run (it gets kind: baseline and proposal_eval.status: "no_baseline"). But the agent is never told this. The run_trace_replayer requirement says "run_trace_replayer requires hardware, model, and workload profiles" — all of which are now satisfied — yet the counter that represents the agent's sense of "how much has been accomplished" is stuck at zero.

This corrupts the agent's self-model. When it reads proposed_runs_successful: 0, it may incorrectly infer it hasn't yet run anything meaningful, affecting its decisions about when to stop, when to validate on full trace, and when to be confident in its analysis.

Fix: Clearly define what increments proposed_runs_successful (only non-baseline runs? only runs that beat the current best?), and provide a separate baseline_runs_completed counter. Alternatively, rename it to agent_proposed_runs_successful and document explicitly that the baseline doesn't count.


6. SLO Formula Has an Ambiguous and Apparently Inconsistent Variable L

The SLO is stated as: TTFT p95 <= 0.001 * L + 1.0 s. The variable L is never defined in the prompt. From context it seems to be prompt length, but:

  • Is L per-request (computed dynamically per request), or the workload-level p95 prompt token count?
  • The workload profile shows p95_prompt_tokens = 7697. If L = 7697, then SLO = 0.001 * 7697 + 1.0 = 8.697 s.
  • But the baseline result shows goodput_search.ttft_slo: 60.0, implying L ≈ 59,000 tokens — nowhere near the p95 prompt length.

This is a critical inconsistency. The agent has no way to derive the correct SLO threshold, verify whether the 60-second figure is right, or reason about how close specific requests are to the SLO boundary. The ttft_slo_pass_rate: 0.9604in the baseline is close to the 0.95 threshold — small errors in SLO interpretation could lead the agent to wrongly classify configs as passing or failing.

Fix: Define L explicitly (e.g., "L is the input prompt token count for each individual request, evaluated per-request"), show a worked example with the trace's actual distribution, and explain how the binary search selects the time_scale that maximizes load while keeping pass_rate ≥ 0.95.


7. time_scale Semantics Are Inverted and Underdocumented

The user task says Default time_scale=2.0. The agent calls profile_workload with time_scale=2.0. The exploration policy says "time_scale is controlled by the tuner's binary search; do not set benchmark_overrides.time_scale." Yet the baseline result shows time_scale: 0.5859375 and goodput_search.selected_scale: 0.5859375 — far from 2.0 and going in the opposite direction.

The relationship between the task-level time_scale parameter (which appears to compress timestamps to speed up replay) and the goodput-search selected_scale (which appears to be the binary search variable that scales arrival rate to find the saturation point) is never explained. The agent is left with two time_scale values with no explanation of how they relate. This makes it impossible for the agent to reason correctly about workload intensity, request rate scaling, and what the goodput binary search is actually doing.

Fix: Add a dedicated section explaining the two-level time scaling architecture: (1) the fixed trace preprocessing time_scale (set once, compresses the original trace timeline), and (2) the binary-search selected_scale (runtime scaling factor applied during goodput search to find the maximum sustainable load). Show the math connecting goodput_qps = overall_qps * selected_scale and how they compound.


8. No Guidance for the Most Obvious High-Value Candidate: DP=2

The system has 8 GPUs. The baseline uses TP=4. This means 4 GPUs are idle from the perspective of model parallelism — the natural candidate is DP=2 (two replicas of TP=4), which doubles throughput capacity at the cost of zero per-GPU memory change. For a prefill-only workload with average 1875 tokens and p95 of 7697 tokens, DP=2 is likely the single highest-impact knob.

Neither the system prompt, exploration policy, knob guidance, nor bottleneck metrics hint at this. The agent has to derive it entirely from first principles — and may not, especially since data_parallel_size is absent from the allowed config paths (see issue 3).

Fix: Add DP to the allowed config paths. Add a note in the knob guidance: "For this system with num_gpus=8 and tensor_parallel_size=4data_parallel_size=2 is feasible and typically the first candidate to explore for throughput maximization."


9. Non-Tunable Fields in Baseline config Arg Create Schema Confusion

The agent submits a run_trace_replayer call with a config that includes fields outside the allowed paths: swap_spacemax_model_lenkv_transfer_confighf_overrideskv_cache_dtypeenforce_eagerquantizationdisable_hybrid_kv_cache_managerenable_prefix_caching, and dozens of non-tunable env vars. The rules say "Only set allowed config paths in tool args.config" — but for the baseline, it's natural to send the full config.

The rules don't clarify whether the baseline run is exempt from the allowed-paths restriction, whether extra fields are silently ignored or cause errors, or how to handle the baseline's non-tunable fields. This creates ambiguity that could cause later runs to fail if the agent continues sending non-tunable fields.

Fix: Add explicit instruction: "For the baseline run only, you may submit the complete baseline config. For all subsequent runs, use only the allowed config paths as a partial patch; the system will inherit unspecified fields from the previous successful run."


10. Workload Duration Mismatch Is Not Flagged or Explained

The task specifies trace_minutes=10.0 as the default. The profile_workload call requested trace_minutes=10.0. But the result shows duration_minutes: 4.999375 — the trace only contains about 5 minutes of data. This means the trace is shorter than the requested window, and the agent is silently running on a truncated workload. The constraint max_trace_minutes=10.0 is nominally satisfied (5 < 10), but the agent has no way to know whether 5 minutes of data is sufficient for stable goodput estimates, whether the tail of the trace is being dropped, or whether using sample_policy=head vs tail would matter given the actual 5-minute duration.

Fix: When profile_workload returns a duration_minutes less than the requested trace_minutes, the system should emit an explicit warning and guidance: "Trace is shorter than requested window. Effective duration is X minutes. Consider whether sample policies are still meaningful and whether min_requests constraint is satisfied."


11. No Explicit Stopping Criterion or Run Budget

The prompt has no defined run budget (e.g., max_runs=20), no convergence threshold (e.g., "stop if no improvement > 2% in last 5 runs"), and no explicit timeout. The instruction "Before stopping, validate top configs on the full configured trace_minutes" is a stopping prerequisite but not a stopping trigger. The agent has no principled termination condition, which leads to either premature termination (being conservative about costs) or open-ended exploration.

Fix: Add explicit budget parameters: max_proposed_runsconvergence_patience (stop if best score unchanged for N runs), and validation_runs_required (how many top-K configs to validate before issuing action=final).


Summary Table

# Category Severity Impact
1 block_size constraint vs. baseline contradiction Critical Agent cannot trust constraint table
2 Diagnostic metrics all zero, O→I→A unfulfillable Critical Core reasoning chain broken
3 Search space vs. allowed paths gap (DP, PP, dcp, pcp, DBO missing) High Major axes of optimization unreachable
4 Contradictory exploration policies High Non-deterministic agent behavior
5 proposed_runs_successful semantics broken Medium Corrupted agent self-state
6 SLO formula: L undefined, ttft_slo=60s inconsistency High SLO pass/fail classification unreliable
7 time_scale two-level semantics undocumented High Goodput reasoning is opaque
8 No DP=2 guidance for 8-GPU system High Most impactful knob unexplored
9 Non-tunable fields in config arg, no baseline exemption rule Medium Schema confusion in future runs
10 Workload duration 5min vs 10min request, silent truncation Medium Unstable goodput estimates possible
11 No stopping criterion or run budget Medium Open-ended or premature termination

The deepest structural problem is the combination of issues 2 and 3: the agent is required to build its reasoning chain on bottleneck signals that are never populated, using configuration knobs that are not in the allowed paths. Everything else is fixable with documentation clarifications, but these two require either infrastructure changes (enabling diagnostic collection, expanding allowed paths) or explicit fallback policies.