Files
obsidian/projects/auto-tuner/sync2.md

2543 lines
108 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## 10.20
#### 1. 调研学习 vLLM 的 DBO 实现细节,确认 vLLM 上实现一个给定执行流自动应用的可能性
一个 config 表示了不同模块的切分与 overlap 方案
![[projects/auto-tuner/sync2.figs/260410-105227.png]]
甚至在 dispatch 和 combine 中也存在 pipeline
![[projects/auto-tuner/sync2.figs/260410-105227-1.png]]
```python
# on hash 55392bc8
# vllm/forward_context.py:233
def create_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = CUDAGraphMode.NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
):
return ForwardContext(
no_compile_layers=vllm_config.compilation_config.static_forward_context,
virtual_engine=virtual_engine,
attn_metadata=attn_metadata,
dp_metadata=dp_metadata,
cudagraph_runtime_mode=cudagraph_runtime_mode,
batch_descriptor=batch_descriptor,
ubatch_slices=ubatch_slices,
)
# vllm/v1/worker/gpu_model_runner.py:1197
ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
num_tokens_unpadded=num_tokens_unpadded,
parallel_config=self.parallel_config,
allow_microbatching=True,
allow_dp_padding=allow_dp_padding,
num_tokens_padded=num_tokens_padded,
uniform_decode=uniform_decode,
num_scheduled_tokens_per_request=num_scheduled_tokens,
)
# vllm/model_executor/layers/fused_moe/modular_kernel.py:658
# class FusedMoEModularKernel
if hook is not None:
if dbo_enabled():
# If DBO is being used, register the hook with the ubatch
# context and call it in dbo_maybe_run_recv_hook instead of
# passing it to the receiver.
dbo_register_recv_hook(hook)
dbo_yield()
else:
hook()
receiver()
# vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py:96
def _do_dispatch(
self,
tokens: torch.Tensor,
token_scales: Optional[torch.Tensor],
rank_topk_ids: torch.Tensor,
rank_topk_weights: torch.Tensor,
num_experts: int,
a1_scale: Optional[torch.Tensor],
quant_config: FusedMoEQuantConfig,
) -> Callable:
has_scales = token_scales is not None
# We yield before launching the dispatch kernel since the dispatch
# kernel will block the CPU so we want to queue up all the compute
# for the other ubatch before the dispatch kernel starts.
dbo_yield_and_switch_from_compute_to_comm()
```
#### 2. 调研 qwen 线上模型情况
> 苏立
From Tao He:
> 线上模型上线需要
> - 精度对齐(以 transformers 做 ground truth符合 naive 直觉:换用更高精度的 kernel 总能 work判断 kernel 本身的 bug如何定义还是精度问题
> - 性能调优:大多 case by case需要很多从第一性原理出发的调优针对模型的新 feature 实现新优化GDNAttentionQwen-Next
> - 稳定性测试:长短 token、长期推理的稳定性
#### TODO
调研 Qwen 线上的针对性优化有哪些a list for feature -> optimization从中总结可以抽象的共同部分确认 Qwen 等其它模型为什么不用 DBO
确认自动并行模式搜索、DBO 为什么不在 Qwen 上使用不重要吗Qwen 手动优化时会优化这些吗?
~~测试 NanoFlow~~
## 10.27
#### 1. 调研 Qwen models 优化现状
[[ali-optimization]]
#### 2. 调研 ali 手动调优 parallelism config 现状
- 对于传统的只跑 TP 的模型,根据经验选择能跑起来的最小 TP指 70B-TP4, 235B-TP8 这样)通常是最好的
- 手动优化主要集中在引入 EP 后的模型,目前手动调优的模型在 5、6 个,每个模型的测试调优时间成本在 2 天左右
- 在 EPxDP 时 调度器 影响更大(同样的 parallelism config不同调度模式会影响 5~10% 的性能),在线的并行模式调整,可能不如调度器的调整,让每个 rank 收到的 pattern 相对固定
- 核心痛点在于压测不能反应线上真实流量Qreplay 线上流量的 1、2h 进行测试A可能会导致调优时间大大增加目前认为主要的影响因素长度 pattern、QPS
#### TODO
DashGen 参数目前的规范作为 baseline
陶乾:异构硬件部署模型的建模
## 11.03
#### 1. 调研 DashGen 现状
目标:根据模型 feature是否 MoE、是否 VL、是否 quant 等)在指定硬件平台自动生成一个配置文件
现状几个常用的模型Qwen3 最近的一些模型)有一个 H20 上的配置文件,无别的模型、别的硬件平台的配置文件,配置文件由人工编写
目前针对不同模型调整较多的参数有:
- `block_size`
- `max_model_len`
- `max_num_batched_tokens`
- `max_num_seqs`
- `rope_scaling.factor`
- `rope_scaling.original_max_position_embeddings`
- `speculative_config.num_speculative_tokens`
- `speculative_config.rope_scaling.factor`
- `speculative_config.rope_scaling.original_max_position_embeddings`
#### 2. 设计系统原型 & 系统搭建
目前完成 hardware prober 与 workload profiler
![[projects/auto-tuner/sync2.figs/260410-105227-2.png]]
#### TODO
把 DashGen 当作 train dataset
进一步总结可以 config 调整的优化
meta analysis 确定调 config 的优化空间
陶乾:异构硬件部署模型的建模
## 11.12
#### 1. 系统原型实现 config generator
如果暴力的 grid searchconfig 太多,一个简单的 8 卡机就会有 20+ configs每组使用开源 trace benchmark 跑 1h 也需要 1 天的卡时,相比人工调优的时间成本没有降低
挑战:如何注入一些基本知识来剪枝,减少搜索空间
#### 2. 确认调整 config 的优化能力
- 来自同事的数据
> 对于内部实际上线的 Qwen3-Max (fp4) ,从 TP8DP1EP8 切到 TP1DP8EP8 的性能(满足 SLO 时的 thpt提升有 40%。正在尝试复现这个测试
- 我们的 micro-benchmark: Qwen3-30B-A3B
| | ttft_s_mean | tpot_s_mean | ttft_s_p90 | tpot_s_p90 |
| --------- | ----------- | ----------- | :--------- | :--------- |
| TP8DP1EP8 | 7.9351 | 0.1022 | 34.8006 | 0.1573 |
| TP1DP8EP8 | 0.1765 | 0.0591 | 0.3776 | 0.0628 |
| TP8DP1EP1 | 3.2645 | 0.0619 | 13.0335 | 0.0809 |
| TP1DP8EP1 | 0.1647 | 0.0544 | 0.3553 | 0.0578 |
#### 3. 关于公司的异构硬件现状
考虑服务 $S_1, S_2, \cdots, S_u$,模型 $M_1, M_2, \cdots, M_v$,资源池 $R_1, R_2, \cdots R_w$。
单机吞吐 $t \in T^{v \times w}$$T_{i, j}$ 表示模型 $i$ 在资源池 $j$ 上能够提供的吞吐
主要考虑的是对资源的调整:
1. 无模型比例
模型需求的吞吐为一个值,例如 (M1, M2) 需求吞吐 (5, 10) 变为 (8, 8),动态切换,拆东墙补西墙 + 分配/释放资源
2. 模型吞吐有严格比例要求
例如 M1:M2 = 2:8 背景A/B test、把部署打散增强 robustness
我个人观点agent 可能会有比较高的这种需求,在一个 workflow 上,不同模型的流量比例有严格要求
计算:虚拟吞吐 = 分配资源量 / 需求比例,虚拟吞吐越高,说明分配越多,可以进行资源切换
#### TODO
找到 case 验证 search 能比人工调优更好的 config
## 11.26
#### 1. 尝试测试 Qwen3-max-fp4
结论:线上部署使用的是 141G 版本的 H208 卡 96G 只能勉强 load fp4 参数,此时会出现报错导致无法跑 prefill线上已经是 PD 分离)
> AssertionError: deepep fp4 only support decode, please run fp8 for prefill reqs
#### 2. 对比线上的 Qwen-30B-A3B
> 线上采用的是单卡运行,无任何 parallelism
由于 _fun fact_,所谓的不开 EP 时的 TP1DP8 并不是我们理解的 DP8需要确定传统意义的 DP8 和 vllm 的 DP8 的性能对比。
- 不同硬件 / 不同负载 都会导致并行模式变化时,推理性能的变化趋势、拐点不同
- 小 QPS 时,大 TP 能够减少 latency大 QPS 时,大 DP 能减少 latency
- 在 5090 与 H20 上,出现的拐点不同
![[projects/auto-tuner/sync2.figs/260410-105227-3.png]]
#### 3. workload 生成与测试
我们需要对 workload 有一个 spec 描述,抽象流量上升沿、平稳、下降沿,内部使用指数分布生成。
![[projects/auto-tuner/sync2.figs/260410-105227-4.png]]
![[projects/auto-tuner/sync2.figs/260410-105227-5.png]]
整体负载较低时,不同的请求到来时间戳对性能影响相对较小。反之整体负载较高时,不同请求的到来时间戳对性能影响较大。
#### Fun facts
MoE 的 `enable-expert-parallel=True/False` 与我们想象的行为不同,即使 `TP=1, DP=2, enable-expert-parallel=False`MoE 部分 expert 的也会 **TP 切分**。开启 EP 之后会按照 experts 的维度切分,不同卡上有不同 experts
> model_executor/layers/fused_moes/config.py:FusedMoeParallelConfig:make
## 12.03
#### 1. 更新 workload 的抽象 spec 描述
支持识别每一个 workload 的 QPS 上升、下降、平稳沿。每一个 timespan 内可以生成对应的 input_length, output_length, KVCache hit ratio
![[projects/auto-tuner/sync2.figs/260410-105227-6.png]]
![[projects/auto-tuner/sync2.figs/260410-105227-7.png]]
#### 2. 调研 sys intelligence
https://github.com/sys-intelligence/system-intelligence-benchmark
https://www.sigops.org/2025/glia-a-human-inspired-ai-for-systems-design-and-optimization/
结论:
- 给了几个 examplecache policy、AE 自动化
- 没有迭代的过程,不知道 evolve 的过程
#### 3. 完成 AI tuner 系统的初版实现
```json
{"tag": "seed_result", "run_id": "seed-0", "config": {"tensor_parallel_size": 1, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 8, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 13, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 12176.020989130817, "latency_ms_p50": 9204.860296100378, "latency_ms_p95": 35188.60311061144, "latency_ms_p99": 45363.17807715386, "ttft_ms_mean": 5719.615500097759, "ttft_ms_p50": 1781.3231199979782, "ttft_ms_p95": 24910.35631671548, "ttft_ms_p99": 35153.06737739593, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.1380589240322, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "seed_result", "run_id": "seed-1", "config": {"tensor_parallel_size": 1, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 8, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 13, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 12345.290610685028, "latency_ms_p50": 9480.132822878659, "latency_ms_p95": 34781.647122465074, "latency_ms_p99": 43191.735486499965, "ttft_ms_mean": 5892.288800230696, "ttft_ms_p50": 2097.73010853678, "ttft_ms_p95": 24136.49755809456, "ttft_ms_p99": 33699.99936595559, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 628.866941701046, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "seed_result", "run_id": "seed-2", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 6, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 13475.863611711437, "latency_ms_p50": 10794.865541160107, "latency_ms_p95": 36951.33968722075, "latency_ms_p99": 51772.90129289031, "ttft_ms_mean": 10182.089378618242, "ttft_ms_p50": 7472.564217634499, "ttft_ms_p95": 32121.336066164076, "ttft_ms_p99": 48282.36513398588, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.5132308205193, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "seed_result", "run_id": "seed-3", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 6, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 39164.60290491039, "latency_ms_p50": 31036.402412690222, "latency_ms_p95": 105731.39869049191, "latency_ms_p99": 124533.90708565712, "ttft_ms_mean": 35547.060281754864, "ttft_ms_p50": 27486.676642671227, "ttft_ms_p95": 100853.97887509316, "ttft_ms_p99": 121570.28573192656, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.0048339375755, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "llm_response", "step": 0, "analysis": "The workload uses Qwen3-30B-A3B, a MoE model with 128 experts. The current runs show that enabling expert parallelism (ep_enabled=true) under TP=2 leads to severe latency degradation (p95 >100s vs ~35s when disabled), likely due to inefficient expert routing or communication overhead across TP groups. In contrast, with TP=1 and DP=8, enabling EP slightly improves latency (~400ms better p95). This suggests that EP works better when all experts fit within a single device or when inter-device expert communication is minimized. All runs use block_size=32, max_tokens=8642 (≈8192+450), max_seqs=13 or 6—both non-standard values violating domain rules. The scheduler may be suboptimal due to arbitrary max_num_seqs and max_num_batched_tokens not aligned to powers of two. GPU memory utilization is stable (~28GB per GPU on H20 with 95GB HBM), so memory pressure isn't the issue. High GPU util (>97%) indicates compute saturation, but poor latency suggests scheduling or batching inefficiencies rather than raw FLOPs shortage. The strong correlation between sequence length and latency (corr=0.48) implies batching isnt effectively overlapping long prompts with decodes.", "num_candidates": 2, "should_stop": false}
{"tag": "iter_result", "run_id": "iter-0-0", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 8, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 4628.36767861444, "latency_ms_p50": 3765.7501455396414, "latency_ms_p95": 11333.310923539102, "latency_ms_p99": 17040.048834867775, "ttft_ms_mean": 1065.5285831100412, "ttft_ms_p50": 221.42069786787033, "ttft_ms_p95": 4601.52723826468, "ttft_ms_p99": 8305.568953976035, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.1981914405968, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "iter_result", "run_id": "iter-0-1", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 8, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3760.064374331547, "latency_ms_p50": 3151.7164893448353, "latency_ms_p95": 9186.337321065366, "latency_ms_p99": 15386.448743753135, "ttft_ms_mean": 540.9742441223084, "ttft_ms_p50": 59.69636607915163, "ttft_ms_p95": 2516.9906625524163, "ttft_ms_p99": 4393.859929405153, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.4289286160771, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "llm_response", "step": 1, "analysis": "The workload uses a MoE model (128 experts) with moderate QPS (1.7) and long prompts (avg 1875 tokens). Recent runs show that enabling expert parallelism (ep_enabled=true) with tp=2 leads to severe latency degradation (p95 >100s in seed-3), but when max_tokens is reduced to 8192 and max_seqs increased to 8, latency improves dramatically (p95 ~11s in iter-0-0). Disabling EP further reduces p95 to ~9s (iter-0-1). Both iter-0-* runs achieve high GPU utilization (>97%) and similar token/s, indicating compute saturation is not the issue. The dramatic latency difference between seed-3 and iter-0-0 (same TP/EP config) is due to max_tokens=8642 vs 8192 and max_seqs=6 vs 8 — the former likely causes KV cache fragmentation or inefficient batching. The strong correlation between sequence length and latency (~0.470.48) suggests scheduling inefficiency under long sequences. Since memory utilization is stable at 0.9 and HBM usage is far below 95% (28GB used on 95GB GPUs), memory is not the bottleneck. The primary issue appears to be poor batch scheduling under large max_tokens and low max_seqs, exacerbated by expert routing overhead when EP is enabled.", "num_candidates": 3, "should_stop": false}
{"tag": "iter_result", "run_id": "iter-1-0", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 4096, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3492.389837023313, "latency_ms_p50": 2872.1146723255515, "latency_ms_p95": 9010.221988894045, "latency_ms_p99": 15249.355898238719, "ttft_ms_mean": 82.92471907821788, "ttft_ms_p50": 41.2894356995821, "ttft_ms_p95": 343.0500505492091, "ttft_ms_p99": 709.19309835881, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.5222736175356, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "iter_result", "run_id": "iter-1-1", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3495.5934863037082, "latency_ms_p50": 2866.79108440876, "latency_ms_p95": 8992.029128596187, "latency_ms_p99": 15476.51156410575, "ttft_ms_mean": 80.40708293078234, "ttft_ms_p50": 42.12937131524086, "ttft_ms_p95": 317.7981022745371, "ttft_ms_p99": 683.7136130779982, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.5214401800225, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "iter_result", "run_id": "iter-1-2", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 4096, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3860.5031585243487, "latency_ms_p50": 3192.2978619113564, "latency_ms_p95": 9960.870377719402, "latency_ms_p99": 16528.3727562055, "ttft_ms_mean": 95.49012954670319, "ttft_ms_p50": 45.08622735738754, "ttft_ms_p95": 396.9422159716487, "ttft_ms_p99": 798.2364166527987, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.4522648664417, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}
{"tag": "llm_response", "step": 2, "analysis": "The workload uses Qwen3-30B-A3B, a large MoE model with 128 experts. Recent runs show that enabling expert parallelism (ep_enabled=true) consistently degrades latency significantly (e.g., p95 latency jumps from ~9s to >100s in seed-3 vs seed-2). However, in later iterations with smaller max_tokens and higher max_seqs (iter-1-1 vs iter-1-2), the gap narrows but ep still adds ~1s to p95 latency. GPU utilization is already very high (~98%), indicating compute saturation. Memory usage is stable (~28GB per GPU on H20 with 95GB HBM), so memory pressure is not the issue. The strong negative impact of expert parallelism suggests inefficient routing or communication overhead among experts across data-parallel ranks when ep is enabled. Given the objective of low latency, disabling expert parallelism is strongly preferred. Among ep-disabled runs, increasing max_num_seqs from 8 to 16 (iter-0-1 → iter-1-1) dramatically improves TTFT (from 2.5s to ~300ms) with minimal latency increase, indicating better scheduling efficiency. Further increasing max_num_batched_tokens beyond 8192 is unnecessary as prompt+decode tokens per request rarely exceed ~2250 on average, and 8192 already allows batching multiple long sequences. Block size of 32 is standard and shows no signs of inefficiency.", "num_candidates": 1, "should_stop": true}
```
#### TODO
- 测试 AI tuner 与 expert 方案的性能对比
- 调研阿里内部异构资源池上 over-provision 的 GPU 资源数据
- 阿里内部异构资源池的 hardware/model 的比例,各 model 的流量比例,以及线上的 config
- 工业现状的限制:工业实际的不同硬件算力能够提供的能力、
## 12.24
(0, 4k, 32k)
| Workload | Config | Max QPS | TTFT p95 | TPOT p95 | TTFT mean | TPOT mean |
| ------------------ | ------------- | ----------- | -------- | -------- | --------- | --------- |
| input>=0,<4096 | tp=1,dp=8 | 14.66 | 370.6 | 47.3 | 158.7 | 35.1 |
| input>=0,<4096 | ==tp=2,dp=4== | ==21.48== | 232.0 | 48.4 | 115.4 | 39.5 |
| input>=0,<4096 | tp=4,dp=2 | 19.26 | 154.8 | 34.9 | 86.3 | 28.2 |
| input>=0,<4096 | tp=8,dp=1 | 17.48 | 995.6 | 21.2 | 166.2 | 18.2 |
| input>=4096,<32768 | tp=1,dp=8 | 1.30 | 1859.5 | 48.1 | 922.8 | 27.2 |
| input>=4096,<32768 | tp=2,dp=4 | 2.19 | 1062.9 | 37.1 | 500.7 | 24.5 |
| input>=4096,<32768 | tp=4,dp=2 | 3.67 | 1014.7 | 48.1 | 419.1 | 34.8 |
| input>=4096,<32768 | ==tp=8,dp=1== | ==3.67== | 846.4 | 41.1 | 334.6 | 34.1 |
| input>=32768 | tp=1,dp=8 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=2,dp=4 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=4,dp=2 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=8,dp=1 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
(0, 8k, 32k)
| Workload | Config | Max QPS | TTFT p95 | TPOT p95 | TTFT mean | TPOT mean |
| ------------------ | ------------- | ----------- | -------- | -------- | --------- | --------- |
| input>=0,<8192 | tp=1,dp=8 | 8.42 | 720.3 | 48.3 | 259.0 | 33.6 |
| input>=0,<8192 | tp=2,dp=4 | 13.60 | 440.5 | 48.4 | 166.0 | 37.4 |
| input>=0,<8192 | ==tp=4,dp=2== | ==15.28== | 295.2 | 39.6 | 117.6 | 31.6 |
| input>=0,<8192 | tp=8,dp=1 | 13.45 | 280.7 | 24.7 | 96.7 | 20.1 |
| input>=8192,<32768 | tp=1,dp=8 | 0.65 | 2397.1 | 32.4 | 1433.3 | 21.4 |
| input>=8192,<32768 | tp=2,dp=4 | 1.41 | 1438.8 | 39.4 | 803.9 | 27.0 |
| input>=8192,<32768 | ==tp=4,dp=2== | ==2.33== | 1100.0 | 47.0 | 556.6 | 35.6 |
| input>=8192,<32768 | tp=8,dp=1 | 2.18 | 1148.9 | 40.2 | 468.1 | 30.5 |
| input>=32768 | tp=1,dp=8 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=2,dp=4 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=4,dp=2 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| input>=32768 | tp=8,dp=1 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
10min:
| **Workload (Input Tokens)** | **Config** | **Max QPS** | **TTFT p95 (ms)** | **TPOT p95 (ms)** | **TTFT mean (ms)** | **TPOT mean (ms)** |
| --------------------------- | -------------- | ----------- | ----------------- | ----------------- | ------------------ | ------------------ |
| **0 < 4096** | tp=1, dp=8 | 13.77 | 370.8 | 47.9 | 163.7 | 35.9 |
| **0 < 4096** | ==tp=2, dp=4== | ==17.92== | 226.5 | 48.8 | 118.3 | 36.1 |
| **0 < 4096** | tp=4, dp=2 | 17.03 | 206.4 | 39.4 | 126.1 | 29.5 |
| **0 < 4096** | tp=8, dp=1 | 14.51 | 367.8 | 24.5 | 102.1 | 19.8 |
| **4096 < 32768** | tp=1, dp=8 | 1.30 | 1756.4 | 44.2 | 885.7 | 27.0 |
| **4096 < 32768** | tp=2, dp=4 | 2.19 | 1097.6 | 43.8 | 515.1 | 28.3 |
| **4096 < 32768** | tp=4, dp=2 | 3.52 | 873.4 | 49.4 | 385.8 | 35.1 |
| **4096 < 32768** | ==tp=8, dp=1== | ==3.67== | 919.7 | 48.1 | 349.0 | 34.0 |
| **>= 32768** | tp=1, dp=8 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| **>= 32768** | tp=2, dp=4 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| **>= 32768** | tp=4, dp=2 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
| **>= 32768** | tp=8, dp=1 | no-slo-pass | 0.0 | 0.0 | 0.0 | 0.0 |
## 01.07
#### 针对 AITuner 的结论
Observations:
- 不同目标p95 TTFT / E2E的 best config 不同
- AI tuner 会多次 iters 中保持类似的性能,探索空间具有局限性
- 即使 prompt 中明确让 LLM 在判断是否能够中止迭代(性能已经稳定优),但 LLM 没有给出过这种判断,都迭代到了 max_step
通过定义结构化的输入:
`HardwareProfile + WorkloadProfile + BenchmarkHistory + SampledRequestBenchmark + HardwareUtilization`
LLM 能够给出一系列符合逻辑的分析:
- 是否开启 EP 对性能的影响
- 不同 TP 变化时性能的变化趋势
- workload profile 结果的分析long prompts p95
- runtime GPU utilization for compute/memory-bound
```
Recent runs show that configurations with higher tensor parallelism (tp=4) and no expert/pipeline parallelism yield significantly lower latency (iter-0-0: 30.8ms p95). However, max_num_batched_tokens=16384 may be insufficient for long prompts (p95=17465 tokens), risking truncation. Enabling expert parallel consistently degrades performance, while pipeline parallelism (iter-0-1) causes severe underutilization (58.3% GPU util) and high latency. The best configuration (iter-0-0) achieves competitive throughput but requires adjustments to handle long prompts safely.
The workload has long prompts (p95=17.4k tokens) and moderate decode lengths. The best run (iter-1-1) achieved low latency with tp=8, pp=1, dp=1 config but showed only 72.7% avg GPU utilization, indicating underutilization. High GPU utilization in iter-1-2 (85.8% avg) suggests compute-bound bottlenecks in some configurations. Expert parallel consistently underperforms, and pipeline parallelism introduces significant latency. The primary opportunity is to optimize batch scheduling for long prompts while maintaining low latency.
```
- Trace A
![[projects/auto-tuner/sync2.figs/260410-105227-8.png]]
- Trace A | 0-8k input_length
```
P95 TTFT: 70.37ms / 791.39ms = 0.089
TP8 + PP1 + DP1 + BLKSZ 128 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 64
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 6943 + MAX_SEQS 38
Mean Lat: 2979.62ms vs 7244.47ms = 0.411
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 64
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 6943 + MAX_SEQS 38
```
![[projects/auto-tuner/sync2.figs/260410-105227-9.png]]
- Trace A | 8k+ input_length
```
P95 TTFT: 912.25ms / 431602.90ms = 0.002
TP4 + PP1 + DP2 + BLKSZ 64 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 18905 + MAX_SEQS 11
Mean Lat: 6680.46ms / 264119.64ms = 0.025
TP8 + PP1 + DP1 + BLKSZ 64 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 18905 + MAX_SEQS 11
```
![[260410-105227-10.png]]
- Thinking
```
P95 TTFT: 465.83ms / 1159.39ms = 0.402
TP2 + PP1 + DP4 + BLKSZ 256 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 256 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 32
Mean Lat: 26883.16ms / 30561.40ms = 0.880
TP2 + PP1 + DP4 + BLKSZ 128 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 8
v.s.
TP1 + PP1 + DP8 + BLKSZ 64 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
```
![[260410-105227-11.png]]
- Trace A | 10min
```
P95 TTFT: 232.67ms / 1348.00ms = 0.173
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 32
Mean Lat: 2738.62ms / 6406.26ms = 0.427
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 32
```
#### TODO
- [x] add input_tokens_total and calculate `overall_tps` and `decode_tps`
- [x] 如果 `client.jsonl` 的 record 是 timeout 等 errorskip
```
{
"duration_ms": "313069.882",
"e2e_mean_ms": "8970.499",
"e2e_p50_ms": "9073.024",
"e2e_p90_ms": "13968.607",
"e2e_p95_ms": "14507.930",
"e2e_p99_ms": "14859.424",
"output_tokens_total": "22566",
"requests_success": "155",
"requests_total": "155",
"throughput_rps": "0.495",
"throughput_tps": "72.080",
"tpot_mean_ms": "131.278",
"tpot_p50_ms": "64.768",
"tpot_p90_ms": "284.904",
"tpot_p95_ms": "583.770",
"tpot_p99_ms": "983.728",
"ttft_mean_ms": "2294.045",
"ttft_p50_ms": "2423.531",
"ttft_p90_ms": "3885.958",
"ttft_p95_ms": "4358.413",
"ttft_p99_ms": "5140.597"
}
```
- [x] 更新 vllm to main to fix DP performance problem
- [x] AITuner 添加一个性能 monitor 的角色,多次 iter 得到相似性能时,该 role 指示 LLM 做更激进的探索
- [x] 支持 early stop当某个 config 性能已经明显爆炸时,提前 quit 测试,避免跑完整的 1h+
- [x] 让 LLM 显式给出每轮 iter 的从数据分析得到的理由以及预期优化目标,并和实际测试结果进行对比
- [x] 比较 10min/60min 效果,是否能缩短时间
- [x] 如何缩短时间iter 轮数 / 每轮时间 / 冷启动naive hack vllm本质 scalability 问题
- [ ] 定量说明 config 需要变化的频率
- [ ] 参考 AI paper怎么定义 context engineering
- [ ] 把分桶加入 AITuner search space
- [x] 不同模型的能力比较
- [ ] 支持多机自动化的任务派发和执行回收
- [ ] 支持历史 benchmark 的索引,作为 long time memory
misc:
- [x] 测试 https://github.com/blitz-serving/trace-replayer/tree/feat/stream 并反馈
- [x] 整理 thinking/coder 的开源 trace
## 01.15
#### 1. Agentic AI Tuner
1. 大的系统重构,测试脚本 more like -> Agentic AI Tuner
![[260410-105227-12.png]]
2. 测试结果和分析
- 加快了 tuning 时间(~30h -> 6h成本估计6h 单机卡时消耗 + LLM API call 0.5 CNY
- 不同模型deepseek-r1 v.s. qwen3-max能够达成共识收敛到基本一致的 best config
- qwen3-max 甚至能 one-shot to near-optimal config
- Define the reward clear. From `minimize the p95 E2E latency while keep TTFT acceptable` to calculable score: `-ttft_ms_p95 - 10 * tpot_ms_mean + 10 * request_throughput`. 否则在不同的 metrics 之间权衡时TTFT/TPOT/E2E latencyLLM 不知道怎么定义最优minimize E2E latency, s.t. TTFT < xxx
- [bug] TPOT 几乎不可优化不同的 config 基本上是在调整 TTFT这是因为引入的 config fields 不够多vLLM 支持一些新的 flag`--prefill-context-parallel-size`, `--decode-context-parallel-size`, `--all2all-backend`加入新的 flags 测试中
- 一些总结
- thinking 模式输出极长由排队论 Little's LawL=λWλQPS1Wmean latency100+因此同时运行的请求数 L100需要调大 `max_num_seqs`对于 thinking调大 `max_num_seqs`能够显著降低 TTFT数量级级别的下降以及把 TP/DP=8/1 -> 4/2虽然 TPOT 由 14ms -> 24ms但是能够显著提高 thpt接近翻倍明显降低 TTFT
- 对于 chat/coder都倾向于提高 TP降低 TPOT 从而显著提高系统性能
- Chat (Trace A) + DeepSeek-R1
![[260410-105227-13.png]]
- Chat (Trace A) + Qwen3-Max
![[260410-105227-14.png]]
- Thinking + DeepSeek-R1
![[260410-105227-15.png]]
- Coder + DeepSeek-R1
![[260410-105228.png]]
- Coder + Qwen3-Max
![[260410-105228-1.png]]
3. related-works
[StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems](https://arxiv.org/pdf/2510.25017)
> 面向多种存储引擎的 LLM agent 自动调参框架把“跑实验—提取信号—搜索配置—沉淀经验”拆成 Executor/Extractor/Searcher/Reflector 四个 agent 协同闭环
[ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training](https://arxiv.org/pdf/2511.03844)
> LLM training 的 agentic 优化系统 Coordinator/Analyzer/Proposal  profiling 工具信号、roofline 分析与“专家最佳实践/历史成功案例”知识库结合起来,把“瓶颈诊断→给出带解释的 sharding/并行配置建议”自动化
[STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems](https://dl.acm.org/doi/epdf/10.1145/3712285.3759887)
> 用于 HPC 并行文件系统的 LLM 自主调参系统,在评估中展示“前几次尝试内即可选到 near-optimal 配置
[Let the Barbarians In: How AI Can Accelerate Systems Performance Research](https://arxiv.org/pdf/2512.14806)
#### 2. Workload / Trace Replayer 开源
- 测试 trace replayer和凯玺同步实现了一些需要的 feature / 修复了一些问题,开源的 trace replayer 已经集成进 Agentic AI Tuner
- 分析处理了 thinking/coder trace并准备一份匿名化数据用于开源
Some Observations
- 线上大约 10% 以上的请求会触发 tool call并携带 tool response 的 content 作为 input
```
tool_call_cnt=1212, ratio=0.1077
<|im_start|>system\n{user_sys}{bailian_sys_prompt}<|im_end|>
<|im_start|>user\n{user_content}<|im_end|>
<|im_start|>assistant\n<think><|im_end|>
<|im_start|>user\n<tool_response>\n{tool_content}<\tool_response><|im_end|>
```
- thinking 下 input_length < 8k但 input_length + output_length > 8k 的占比大约 20%(相比之下 coder 为 1%),导致多轮对话少(因为只从一个 cluster 内采集,可能是被分配到了不同的 cluster
- 每一轮的 thinking content 不会被加入到 context 中
- `token_ids[0]` 可能是 `'truncated'`
#### TODO
- [x] 更新 agent 支持 `all2all_backend`, `prefill_context_parallel_size`, `decode_context_parallel_size` 等搜索空间,*修复 bug*
- [x] 确认匿名 trace 的 session 间 prefix 命中情况
- [ ] human feedback add to context
- [ ] 搭建 agent 全局知识库,实现测试的知识迁移,降低冷启动时间
- [x] 总结 related-works目前有 1. 做 agent 服务别的场景training/storage/...) 2. BO 用来调优 inference config 的,总结他们的现状,说明不同
- [ ] 测试复现 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning ,作为 BO 的 baseline
## 01.20
#### 1. AITuner
搭建完成一套基本可使用的 agent tuning 系统,考虑下一阶段的目标
https://x8csr71rzs.feishu.cn/docx/FyChdSYeKoiqzAxQTqScsiqjnvd
#### 2. Trace Open source
https://code.alibaba-inc.com/shensijie.ssj/qwen-bailian-usagetraces-anon/tree/main
#### TODO
- [ ] 确认 Ali 内部的不同模型的运行时 config作为 baseline 比较
- [x] 全局知识库的搭建
- [x] ready-check 轮询期间如果发现 vllm 进程退出return code 非 0立刻 fail-fast并把 stderr/log tail 解析成结构化 `invalid_config`
## 01.28
#### Agentic AITuner
- 和涛老师确认了线上真实使用的配置,开始与真实的线上情况进行对比
现存的问题:
- vLLM DP 是假 DP存在同步等开销和 replica 的性能对比有较为明显的区别
- 曾经的测量为直接的 latency 比较,但是线上会把单机 8 卡拆为 8 个 instance每个 instance 独立 serving因此曾经比较 latency 的方式来做 tuning 不可行(需要一个 global scheduler 的支持),我们需要转换为 Goodput / GPU 的比较方式
- Goodput 的测量很慢binary search是否有等价的转换方式单次测试时间为 T则测量 goodput 需要 $O(T \cdot \log(QPS))$,能否减少到 $O(T)$
- AI Tuner 目前测试中没有找到更好的配置,而且和我期望的 tuning 方向总是不同
- 发现:之前实现了一套 long-time memory在现在反而起到了误导性的作用曾经是沿着 latency 优化,存在一些 long-time memory 中的 insights 让 LLM 总是 prefer 多卡并行等,即使我告诉 LLM 可以使用 replica 也不太响应
#### TODO
- [x] 修复 workload sampler 的问题
- [x] 修复 goodput binary search
- [x] 支持关闭 long-time memory
- [ ] 当一个 replica 不需要吃满时,支持多卡并行
## 0204
![[260410-105228-2.png]]
![[260410-105228-3.png]]
![[260410-105228-4.png]]
![[260410-105228-5.png]]
- SLO 5s TTFT + 0.1s TPOT | 5min
- qwen3-30b-a3b | trace A能够找到比 baseline 好的配置,但是只好 5%,主要在于 scheduling batch 的改变,加大了 batch size3.36 vs 3.21
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260128-195623.jsonl
- qwen3-30b-a3b | thinking使用 tp2+replica41.51 vs 1.27
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-103543.jsonl
- qwen3-30b-a3b | coder使用 tp2+replica41.54 vs 1.24
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-101657.jsonl
- SLO (0.001L + 1s) TTFT + 0.1s TPOT | 5min
3. qwen3-30b-a3b | trace A使用 tp2+replica43.31 vs 3.06
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260130-114633.jsonl
4. qwen3-30b-a3b | thinking使用 "max_num_batched_tokens": 13312, "max_num_seqs": 1041.51 vs 1.24
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260130-115007.jsonl
5. qwen3-30b-a3b | coder使用 tp2+replica41.54 vs 1.22
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-101657.jsonl
#### TODO
- [x] [Bug] 使用 Goodput per GPU 的比较之后agent early quit由于最开始 propose 的 3 个 config 都是使用了 8卡导致了失败
- [x] 测试分析 coder 5min 的结果
- [x] 测试不同的 SLO选择 TTFT=5s, TPOT=40ms理由发现 TP2 下 goodput 相对差一些,但是 perf 明显好)
- [x] 支持线性 TTFT SLO
- [x] goodput 应该是一个单调的性质baseline 测试之后,后面所有的应该至少从 baseline 开始测,如果不如 baseline 直接 abort
- [x] 换一段时间 trace 测试 tuning 得到的 config
## 0304
实验结果: https://x8csr71rzs.feishu.cn/docx/HpkWdzRKroMq9VxWT9qcPpqEnke?from=from_copylink
Tuning 逻辑复盘: https://x8csr71rzs.feishu.cn/docx/OTg7di8sdoBX0PxOhAEc4VosnCp?from=from_copylink
#### TODO
- [ ] 确认百炼线上使用的环境vllm 版本 / 硬件环境),在线上进行比较
- [ ] 尝试 agent 是否能做更严格的 bottleneck 分析
## Ongoing
### Qwen3-30B
- SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs
- qwen3-30b-a3b | trace A使用 tp22.634 vs 2.137
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-095345.jsonl
![[260410-105228-6.png]]
- qwen3-30b-a3b | thinking使用 tp40.298 vs N/Athinking 下使用 1200 reqs3600 reqs 的测试时间不可接受
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-194508.jsonl
![[260410-105228-7.png]]
- qwen3-30b-a3b | coder使用 tp40.866 vs 0.612
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-095632.jsonl
![[260410-105228-8.png]]
- SLO (0.001L + 1s) TTFT + 0.05s TPOT | 7200 reqs
- qwen3-30b-a3b | trace A使用 tp22.078 vs 1.784
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260204-011648.jsonl
![[260410-105228-9.png]]
- qwen3-30b-a3b | coder使用 tp40.884 vs 0.569
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260204-144911.jsonl
### Qwen3-30B 添加 post-compare
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003450.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003526.jsonl
v2: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260219-144823.jsonl (b=10s)
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003543.jsonl
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-234740.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-154224.jsonl
v2: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260220-104923.jsonl (b=10s)
b=1s: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104251.jsonl
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-234705.jsonl
### Qwen3-235B [FP8]
- qwen3-max 做 agent
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
0.486 / 0.386
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222655.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
0.175 / 0.142
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260212-002234.jsonl
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
0.347 / 0.313
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222825.jsonl
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
0.491 / 0.415
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260211-133931.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
0.180 / 0.144
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222757.jsonl
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
0.366 / 0.290
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260211-192141.jsonl
- deepseek-r1 做 agent
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
0.440 / 0.386
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-110457.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
0.177 / 0.148
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260212-183021.jsonl
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-163437.jsonl
- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-110941.jsonl
- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
0.182 / 0.144
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-110427.jsonl
- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-164917.jsonl
### 比较不同 SLO 下的情况
- Qwen3-30-A3B | qwen3-max | SLO (0.001L + 1s) TTFT + 0.02s TPOT
- Trace A | | 1200 reqs | tail 1200 reqs test
1.173 / 0.771
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-112827.jsonl
- Thinking | 600 reqs | tail 600 reqs test
0.258 / 0.078
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260226-193932.jsonl
- Coder | 1200 reqs | tail 1200 reqs test
0.644 / 0.450
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-224109.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.0005L + 0.5s) TTFT + 0.05s TPOT
- Trace A | | 1200 reqs | tail 1200 reqs test
0.258 / 0.201
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-112420.jsonl
- Thinking | 600 reqs | tail 600 reqs test
0.098 / 0.074
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260221-111155.jsonl
- Coder | 1200 reqs | tail 1200 reqs test
0.219 / 0.163
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260220-180145.jsonl
### 完整的性能趋势
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Trace A | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104637.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Thinking | 600 reqs | tail 600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260228-115937.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Coder | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260225-104917.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Trace A | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260227-104815.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Coder | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260227-104732.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Thinking | 1200 reqs | tail 1200 reqs test
==waiting on dash2==
```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only --disable-goodput-best-scale-compare
```
### Latency Compare
Qwen3-235B-A22B | Trace A | 60min | TP4
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-121540/compare_summary.json
Qwen3-235B-A22B | Thinking | 60min | TP4
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-162510/compare_summary.json
Qwen3-235B-A22B | Coder | 60min
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-162616/compare_summary.json (TP4 is bad, match the AITuner)
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-212148/compare_summary.json (TP8)
https://nas.gahow.org/webdav/trace_compare/dash/compare_260302-001737/compare_summary.json
Qwen3-235B-A22B | Coder | 30min | TP8
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-225747/compare_summary.json
Qwen3-30B-A3B | Trace A | 30min | TP2
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-225950/compare_summary.json
```bash
./run_latency_cmp.sh --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --min-requests 3600
./run_latency_cmp.sh --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --trace-minutes 60 --candidate-config baseline/qwen3-235b-a22b_prefill_ep_tp8.json --baseline-time-scale 1.0 --candidate-time-scale 2.0
```
### Reasoning 总结
- [ ] 总结 agentic AITuner 分析的结果config 是怎么从 context 来的
Trace A `ai_tuner_260210-222655.llm.parsed.jsonl`
首先尝试修改 TP 发现没有提升,然后开启了 EP+high_throughtput发现有明显提升 1.07 vs 0.88,然后开始改 batching 相关参数,发现改大 batching 会导致 SLO violation回退到默认的 batching 参数,此时有趣的是模型会重跑一次当时发现的最好的 config确认它的性能然后再做 recommendation
LLM 会发现改 EP backend 没什么收益,快速放弃尝试别的 EP backend
问题:
- TP=2 已经失败过一次LLM 又进行了一次推荐
- 很多时间在重复推相同 config 在不同 time_scale 直接测试性能
> After extensive exploration of 16 successful configurations, the optimal setup for maximizing goodput per GPU while meeting SLOs is to enable expert parallelism with the deepep_high_throughput all2all backend. This configuration consistently achieves ~1.09 goodput QPS per GPU at time_scale=1.33, which is a 23% improvement over the baseline (0.88). Key findings: 1) Expert parallelism is crucial for this MoE model, 2) Conservative batching parameters (8192 tokens, 64 sequences) are necessary to maintain SLO compliance, 3) Block size of 32 performs slightly better than 16, 4) Other all2all backends (low_latency, flashinfer, pplx) don't provide significant benefits over high_throughput, 5) Pipeline parallelism and smaller tensor parallelism configurations either failed or didn't improve performance.
- 在 LLM 已经决定要 final 的时候强制延长步数并没有用LLM 只会愚蠢的继续做一些毫无意义的尝试。
- 模型并没有做严格的逻辑推导和 bottleneck 分析,例如「由于看到 xxx根据学到的 yyy我判断开启 EP 能够缓解这个问题」。模型在做的是,「修改 TP 作用不大,这是一个 MoE 模型,所以我尝试一下开启 EP」
Trace A `ai_tuner_260210-222825.llm.parsed.jsonl`
### Ali Deployment
[[Ali Deployment]]
Qwen3-235B Prod 配置:
1. 【Prod】【P】【0-8K&SSE】【bj-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.5a39103cAE5kRw
2. 【Prod】【Chat】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-chat/deployments/qwen3-235b-a22b-chat-1b9d
3. 【Prod】【bj-96-d】【fp4】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-model/deployments/qwen3-235b-a22b-model-8b5d?spm=43a6e6f6.5d389483.0.0.7eb6103cee6UDx
#### Trace A | 10min
- Ali Baseline
```
"ttft_mean_ms": "855.756",
"ttft_p50_ms": "394.927",
"ttft_p90_ms": "2053.206",
"ttft_p95_ms": "2908.505",
"ttft_p99_ms": "4271.465"
```
- 社区版
```
"ttft_mean_ms": "516.033",
"ttft_p50_ms": "187.580",
"ttft_p90_ms": "1301.450",
"ttft_p95_ms": "1727.253",
"ttft_p99_ms": "3665.331"
```
```bash
python3 scripts/check_prod_dashllm_flow.py --model-path /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 --model-name pre-engine-config-tuning-research --endpoint https://poc-dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation --engine-config-path /usr/local/lib/python3.12/dist-packages/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20.config --authorization-bearer-token-env sk-1fdaf28f959c442996634bd38f6e07b4
```
`/usr/local/bin/dashllm_cmd serving`
`/usr/local/lib/python3.12/dist-packages/vllmgen` 中有真正用的 config
`/usr/local/lib/python3.12/dist-packages/dashgen` 的 config 是 old 的
`DS_LLM_MULTI_ENGINE_NUM=2`
`export DS_SERVING_PROTOCOL=openai`
For EP
```
VLLM_MOE_USE_DEEPEP
VLLM_USE_DEEP_GEMM
"enable_expert_parallel": true,
```
```
AcclEP: checkhang=0, hangtimeout=0
AcclEP: checkhang=0, hangtimeout=0
AcclEP: checkhang=0, hangtimeout=0
[0/4,TP0/EP0][pid=126019] AcclEP: Buffer init with: nRanks 4, num_nvl_bytes: 104.940416MB, num_rdma_bytes: 270.009472MB, low_latency_mode: True, num_qps_per_rank: 24, allow_nvlink_for_low_latency_mode: True, allow_mnnvl: False
AcclEP version: 1.1.0.9.4+384c657-pai
AcclEP: ACCL_LOW_LATENCY_OPTIMIZE = 2
AcclEP: NVSHMEM_IBGDA_NUM_RC_PER_PE = 4
AcclEP: checkhang=0, hangtimeout=0
AcclEP: Initialize low latency buffer:
rank = 0
num_ranks = 4
num_nvl_bytes = 104940416
num_rdma_bytes = 270 MB
low_latency_mode = 1
dispatch_use_fp8 = 1
combine_use_fp8 = 0
low_latency_optimize = 2
use_single_buffer = 0
buffer_opt_fp8 = 0
buffer_max_tokens = 0
buffer_max_topk = 0
```
修复 dash 机器上 build trace-replayer 的问题:
```bash
# 安装 ssl 相关
sudo apt update
sudo apt install -y pkg-config libssl-dev
# 解决在 nfs 上的 build 问题,放到本地 build
export CARGO_HOME=/tmp/cargo
export CARGO_TARGET_DIR=/tmp/target
```
需要 `DASH_GEN_ENABLE=1` 来保证使用 dashgen
`export MODEL_PATH=/root/dashllm/serving/AnyModel/model`
TODO:
- [x] 启动 dashllm本地发送 http 请求
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[WARNING] 2026-03-06 00:17:44,517037: {"filename":"vllm.py","lineno":784,"message":"DeepEP (expert parallel) doesn't support full cudagraphs for prefill/mixed batches. Overriding c
### 比较不同模型的能力
```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --time-scale 2.0 --objective 'maxmize goodput per GPU' --llm-provider prism --llm-model 'claude-sonnet-4-6' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only
```
Coder | 1200 reqs
- Sonnet-4.6
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260308-174846.jsonl
- GPT-5.2
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260309-105131.jsonl
- Gemini-3.1-pro
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260310-084234.jsonl
### 测试 codex 直接进行 tuning 的效果
[[codex-problems]]
`codex -a never -s workspace-write`
- v3: refined from v2
```
You are an autonomous systems tuning agent.
Your task is to discover the best configuration for a vLLM serving system
by running real experiments.
Your objective is to maximize throughput per GPU while satisfying strict
SLO constraints.
You must design the tuning strategy, run experiments, analyze results,
and iteratively improve the configuration.
You are starting from an EMPTY PROJECT DIRECTORY.
You must build the minimal experimental framework needed to complete
this task.
Your goal is NOT simply to find a good configuration quickly.
Your goal is to explore the configuration space sufficiently and provide
evidence that the final configuration is close to optimal for this workload.
Prematurely declaring a configuration as "best" without sufficient
exploration is considered an incomplete solution.
--------------------------------------------------
Workspace and File Access Rules (STRICT)
--------------------------------------------------
Your primary workspace is the CURRENT PROJECT DIRECTORY.
You may freely read and modify any file located inside this directory.
This includes:
- scripts you create
- experiment logs
- helper utilities
- the Python virtual environment located at:
.venv/
You may also inspect the vLLM source code located at:
.venv/lib/python3.12/site-packages/vllm
Reading this code is allowed so you can understand:
- available configuration parameters
- scheduler behavior
- KV cache implementation
- MoE / expert parallel configuration
- batching and token scheduling logic
--------------------------------------------------
Allowed External Read-Only Resources
--------------------------------------------------
You are allowed to READ (but not modify) the following paths.
Model directory:
~/models/Qwen/Qwen3-235B-A22B-FP8
You may inspect the model files to understand:
- model architecture
- tensor shapes
- layer structure
- MoE configuration
- hidden sizes
- attention configuration
Workload trace:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
You may read this file to load requests.
--------------------------------------------------
Forbidden Paths
--------------------------------------------------
You MUST NOT read or use code from any other locations.
Do NOT access:
- parent directories (..)
- other repositories
- external tuning frameworks
- any existing ai_tuner implementations
- system-wide Python installations
- unrelated directories
All tuning logic must be implemented by you
inside the current project directory.
--------------------------------------------------
Environment Setup
--------------------------------------------------
Before running any experiment activate the environment:
source .venv/bin/activate
Use the vLLM installation provided in this environment.
Model to serve:
~/models/Qwen/Qwen3-235B-A22B-FP8
Workload trace file:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Use ONLY the FIRST 1200 requests in the trace.
Evaluation mode:
prefill-only
--------------------------------------------------
Serving Requirement
--------------------------------------------------
You must run experiments using the vLLM SERVING ENGINE.
Launch the system using the server interface
(e.g. `vllm serve` or equivalent server mode).
Do NOT use the offline LLM inference API.
The system must behave as a real serving system with
request batching and scheduling.
--------------------------------------------------
Experimental Framework
--------------------------------------------------
You must implement a minimal experiment framework in this directory.
You may create scripts such as:
trace_loader.py
workload_runner.py
launch_vllm.py
metrics.py
experiment_runner.py
tuner.py
These scripts should allow you to:
1. Load the first 1200 requests from the workload trace
2. Launch vLLM with a specified configuration
3. Send requests to the serving system
4. Run evaluation in prefill-only mode
5. Measure latency and throughput metrics
6. Compute SLO pass rate
7. Log experiment results
All experiments must be reproducible.
--------------------------------------------------
Optimization Objective
--------------------------------------------------
Maximize:
throughput_per_GPU
Subject to the following SLO constraints.
1. TTFT constraint
TTFT <= (0.001 * L + 1.0) seconds
where L is the input_length of the request.
2. TPOT constraint
TPOT <= 0.1 seconds
3. SLO pass rate
>= 95%
Only configurations satisfying ALL constraints
are considered valid.
--------------------------------------------------
Parameters You May Tune
--------------------------------------------------
You may tune any relevant vLLM configuration parameters
supported by the installed version.
Examples include (but are not limited to):
- tensor_parallel_size
- pipeline_parallel_size
- data_parallel_size
- expert_parallel_size
- block_size
- max_num_batched_tokens
- max_num_seqs
- gpu_memory_utilization
- kv cache parameters
- scheduler parameters
- CUDA graph settings
- kernel optimization flags
You should inspect the vLLM source code if necessary
to discover available configuration options.
--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------
Every experiment MUST append a JSON object to:
tuning_results.jsonl
Example format:
{
"experiment_id": 7,
"config": {
"tensor_parallel_size": 8,
"block_size": 16,
"max_num_batched_tokens": 32768
},
"metrics": {
"throughput": 52.3,
"throughput_per_gpu": 6.54,
"ttft_mean": 0.84,
"ttft_p95": 1.12,
"tpot_mean": 0.041,
"tpot_p95": 0.078,
"slo_pass_rate": 0.972
},
"slo_satisfied": true,
"notes": "Increasing batching improved throughput without violating TTFT."
}
This log file must allow reconstruction of the full tuning trajectory.
--------------------------------------------------
Exploration Memory
--------------------------------------------------
You must maintain an exploration memory file:
exploration_ledger.json
This file should track:
- tested configurations
- configurations that violate SLO
- current best configuration
- hypotheses about parameter effects
- unexplored promising regions
Future experiments must use this memory to guide search.
--------------------------------------------------
Search Strategy
--------------------------------------------------
Before running experiments you must propose a search plan.
The plan should describe:
- which parameters will be explored first
- which parameters are likely to impact throughput
- how initial exploration will cover diverse regions
- how later experiments will refine promising areas
The search should typically include the following phases:
1. Initial exploration
Evaluate diverse configurations to understand
the performance landscape.
2. Frontier discovery
Identify promising regions of the configuration space.
3. Local refinement
Perform targeted experiments near the best configurations.
4. Verification
Confirm the best configuration through repeated runs
and nearby configurations.
--------------------------------------------------
Search Termination Criteria
--------------------------------------------------
You must continue exploring the configuration space
until there is evidence that performance has converged.
Search may stop when ALL of the following conditions hold:
1. The best throughput_per_GPU has not improved by more
than 2% over a significant number of recent experiments.
2. Several neighboring configurations of the current best
configuration have been tested.
3. Additional exploration appears unlikely to produce
significant improvements.
You must explain why the search can reasonably stop.
--------------------------------------------------
Local Optimality Verification (MANDATORY)
--------------------------------------------------
Before declaring the final configuration you MUST test
multiple configurations that are small variations of
the current best configuration.
Examples include:
- changing block_size
- adjusting max_num_batched_tokens
- adjusting max_num_seqs
- small changes in parallelism configuration
If a neighboring configuration performs better while
satisfying SLO constraints, the search must continue.
--------------------------------------------------
Evaluation Methodology
--------------------------------------------------
All configurations must be evaluated using the SAME dataset:
first 1200 requests from
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Evaluation mode:
prefill-only
For each configuration measure:
- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate
All runs must use the same measurement procedure.
--------------------------------------------------
Final Deliverables
--------------------------------------------------
When tuning finishes output the following.
1. Best configuration
Provide as JSON or YAML.
2. Exact vLLM launch command
3. Exact evaluation command or script
4. Final measured metrics
- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate
5. Total number of experiments executed
6. Table of the TOP 10 configurations ranked by throughput_per_GPU
7. Path to the full experiment log
tuning_results.jsonl
8. Explanation of why the chosen configuration is likely
close to optimal within the explored search space.
--------------------------------------------------
Goal
--------------------------------------------------
Discover the best vLLM configuration for serving
Qwen3-235B-A22B-FP8
on this workload under the given SLO constraints.
The final configuration must be supported by real
experimental evidence and maximize:
throughput_per_GPU
```
- v2完全独立的 tuning
```
You are an autonomous systems tuning agent.
Your task is to discover the best configuration for a vLLM serving system
by running real experiments.
Your objective is to maximize throughput per GPU while satisfying strict
SLO constraints.
You must design the tuning strategy, run experiments, analyze results,
and iteratively improve the configuration.
You are starting from an EMPTY PROJECT DIRECTORY.
You must build the minimal experimental framework needed to complete
this task.
--------------------------------------------------
Workspace and File Access Rules (STRICT)
--------------------------------------------------
Your primary workspace is the CURRENT PROJECT DIRECTORY.
You may freely read and modify any file located inside this directory.
This includes:
- scripts you create
- experiment logs
- helper utilities
- the Python virtual environment located at:
.venv/
You may also inspect the vLLM source code located at:
.venv/lib/python3.12/site-packages/vllm
Reading this code is allowed so you can understand:
- available configuration parameters
- scheduler behavior
- KV cache implementation
- MoE / expert parallel configuration
- batching and token scheduling logic
--------------------------------------------------
Allowed External Read-Only Resources
--------------------------------------------------
You are allowed to READ (but not modify) the following paths:
Model directory:
~/models/Qwen/Qwen3-235B-A22B-FP8
You may inspect the model files to understand:
- model architecture
- tensor shapes
- layer structure
- MoE configuration
- hidden sizes
- attention configuration
Workload trace:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
You may read this file to load requests.
--------------------------------------------------
Forbidden Paths
--------------------------------------------------
You MUST NOT read or use code from any other locations.
Do NOT access:
- parent directories (..)
- other repositories
- external tuning frameworks
- any existing ai_tuner implementations
- system-wide Python installations
- unrelated directories
All tuning logic must be implemented by you
inside the current project directory.
--------------------------------------------------
Environment Setup
--------------------------------------------------
Before running any experiment activate the environment:
source .venv/bin/activate
Use the vLLM installation provided in this environment.
Model to serve:
~/models/Qwen/Qwen3-235B-A22B-FP8
Workload trace file:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Use ONLY the FIRST 1200 requests in the trace.
Evaluation mode:
prefill-only
--------------------------------------------------
Serving Requirement
--------------------------------------------------
You must run experiments using the vLLM SERVING ENGINE.
Launch the system using the server interface
(e.g. `vllm serve` or equivalent server mode).
Do NOT use the offline LLM inference API.
The system must behave as a real serving system with
request batching and scheduling.
--------------------------------------------------
Experimental Framework
--------------------------------------------------
You must implement a minimal experiment framework in this directory.
You may create scripts such as:
trace_loader.py
workload_runner.py
launch_vllm.py
metrics.py
experiment_runner.py
tuner.py
These scripts should allow you to:
1. Load the first 1200 requests from the workload trace
2. Launch vLLM with a specified configuration
3. Send requests to the serving system
4. Run evaluation in prefill-only mode
5. Measure latency and throughput metrics
6. Compute SLO pass rate
7. Log experiment results
All experiments must be reproducible.
--------------------------------------------------
Optimization Objective
--------------------------------------------------
Maximize:
throughput_per_GPU
Subject to the following SLO constraints.
1. TTFT constraint
TTFT <= (0.001 * L + 1.0) seconds
where L is the input_length of the request.
2. TPOT constraint
TPOT <= 0.1 seconds
3. SLO pass rate
>= 95%
Only configurations satisfying ALL constraints
are considered valid.
--------------------------------------------------
Parameters You May Tune
--------------------------------------------------
You may tune any relevant vLLM configuration parameters
supported by the installed version.
Examples include (but are not limited to):
- tensor_parallel_size
- pipeline_parallel_size
- data_parallel_size
- expert_parallel_size
- block_size
- max_num_batched_tokens
- max_num_seqs
- gpu_memory_utilization
- kv cache parameters
- scheduler parameters
- CUDA graph settings
- kernel optimization flags
You should inspect the vLLM source code if necessary
to discover available configuration options.
--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------
Every experiment MUST append a JSON object to:
tuning_results.jsonl
Example format:
{
"experiment_id": 7,
"config": {
"tensor_parallel_size": 8,
"block_size": 16,
"max_num_batched_tokens": 32768
},
"metrics": {
"throughput": 52.3,
"throughput_per_gpu": 6.54,
"ttft_mean": 0.84,
"ttft_p95": 1.12,
"tpot_mean": 0.041,
"tpot_p95": 0.078,
"slo_pass_rate": 0.972
},
"slo_satisfied": true,
"notes": "Increasing batching improved throughput without violating TTFT."
}
This log file must allow reconstruction of the full tuning trajectory.
--------------------------------------------------
Evaluation Methodology
--------------------------------------------------
All configurations must be evaluated using the SAME dataset:
first 1200 requests from
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Evaluation mode:
prefill-only
For each configuration measure:
- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate
All runs must use the same measurement procedure.
--------------------------------------------------
Search Strategy
--------------------------------------------------
Use a structured exploration strategy.
Recommended phases:
1. Initial exploration
Run diverse configurations to understand
the performance landscape.
2. Frontier discovery
Identify promising regions of the configuration space.
3. Local refinement
Fine tune parameters near the best configurations.
4. Confirmation
Re-run the best configuration to verify stability.
Avoid wasting time on configurations that clearly violate
SLO constraints.
Avoid re-running identical configurations unless verifying
the best result.
--------------------------------------------------
Final Deliverables
--------------------------------------------------
When tuning finishes output the following.
1. Best configuration
Provide as JSON or YAML.
2. Exact vLLM launch command
3. Exact evaluation command or script
4. Final measured metrics
- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate
5. Summary table of key tested configurations
6. Path to the full experiment log
tuning_results.jsonl
7. Short explanation of why the chosen configuration
is optimal.
--------------------------------------------------
Goal
--------------------------------------------------
Discover the best vLLM configuration for serving
Qwen3-235B-A22B-FP8
on this workload under the given SLO constraints.
The final configuration must be supported by real
experimental evidence and maximize:
throughput_per_GPU
```
prompt:
```
You are an autonomous systems tuning agent.
Your goal is to tune the configuration of a vLLM serving system to maximize throughput per GPU for a specific workload, while satisfying strict SLO constraints.
You must run real experiments, analyze results, and iteratively improve the configuration.
--------------------------------------------------
Environment Setup
--------------------------------------------------
First activate the Python environment:
source .venv/bin/activate
Use the vLLM installation inside this environment.
Model to serve:
~/models/Qwen/Qwen3-235B-A22B-FP8
Workload trace file:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Use only the FIRST 1200 requests in the trace as the evaluation dataset.
Evaluation mode:
prefill-only
--------------------------------------------------
Important Restrictions
--------------------------------------------------
1. DO NOT use my ai_tuner code or its search logic.
2. You MAY use simple helper utilities such as workload samplers or trace loaders.
3. All tuning logic must be designed and executed by yourself.
4. Every configuration must be evaluated by actually running the system.
5. Avoid repeating identical configurations unless confirming the best result.
--------------------------------------------------
Optimization Goal
--------------------------------------------------
Maximize:
throughput_per_GPU
Subject to SLO constraints:
1. TTFT <= (0.001 * L + 1.0) seconds
where L is the input_length.
2. TPOT <= 0.1 seconds
3. SLO pass rate >= 95%
Only configurations satisfying all SLO constraints are considered valid.
--------------------------------------------------
Parameters You May Tune
--------------------------------------------------
Tune relevant vLLM serving parameters, including (but not limited to):
- tensor_parallel_size
- pipeline_parallel_size (if applicable)
- data_parallel_size (if applicable)
- expert_parallel_size / MoE related options
- block_size
- max_num_batched_tokens
- max_num_seqs
- scheduling / batching parameters
- gpu_memory_utilization
- kv-cache settings
- cuda graph settings
- compilation / kernel optimization knobs
Only tune parameters supported by the installed vLLM version.
--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------
You MUST record every experiment result in a file:
tuning_results.jsonl
Each experiment must append one JSON object to this file.
Example schema:
{
"experiment_id": 7,
"config": {
"tensor_parallel_size": 8,
"block_size": 16,
"max_num_batched_tokens": 32768
},
"metrics": {
"throughput": 52.3,
"throughput_per_gpu": 6.54,
"ttft_mean": 0.84,
"ttft_p95": 1.12,
"tpot_mean": 0.041,
"tpot_p95": 0.078,
"slo_pass_rate": 0.972
},
"slo_satisfied": true,
"notes": "Increasing batching improved throughput without violating TTFT."
}
This file must allow someone to reconstruct the entire tuning trajectory and observe how performance evolves during the search.
--------------------------------------------------
Evaluation Methodology
--------------------------------------------------
Use the first 1200 requests from:
qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl
Run the workload in:
prefill-only mode
For each configuration measure:
- throughput
- throughput per GPU
- TTFT statistics
- TPOT statistics
- SLO pass rate
All runs must use the same dataset and measurement procedure.
--------------------------------------------------
Tuning Strategy
--------------------------------------------------
Use a structured exploration strategy.
Suggested approach:
1. Initial exploration
- test a diverse set of configurations to understand the performance landscape
2. Frontier search
- identify promising regions
3. Local refinement
- fine tune parameters near the best configs
4. Confirmation
- rerun the final best configuration once more
Avoid wasting time on configurations that clearly violate SLO constraints.
--------------------------------------------------
Execution Policy
--------------------------------------------------
You are allowed to:
- inspect repository files
- write helper scripts
- write workload runners
- write measurement scripts
If necessary, create scripts to:
- load the first 1200 requests
- launch vLLM with a given config
- run prefill-only evaluation
- compute metrics
- append results to tuning_results.jsonl
Ensure experiments are reproducible.
--------------------------------------------------
Final Deliverables
--------------------------------------------------
At the end of the tuning process output:
1. Best configuration (JSON or YAML)
2. Exact vLLM launch command
3. Exact evaluation command or script
4. Final measured metrics
- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate
5. Summary table of key tested configurations
6. Path to the full experiment log:
tuning_results.jsonl
7. Brief explanation of why the chosen configuration is optimal.
--------------------------------------------------
Goal
--------------------------------------------------
Find the single best vLLM configuration for serving
Qwen3-235B-A22B-FP8
on this workload under the given SLO constraints, based on real experimental evidence.
```
### 观察
- tuning 得到的 long-time memory 太过 specific没有泛化性看起来是直接总结了当前 tuning 得到的 config 的好坏,改变不同 flag 的值的效果(例如:直接得到 insightTP=4 好),这样的 memory 反而可能会影响迁移的效果,模型并没有总结到 (model, hardware) -> config 的更加 high level 的可迁移的 memory
- 一个有趣的事情thinking trace 下,使用 50ms TPOT SLO 会在 baseline 就 violate即使 time_scale 已经足够小 0.0625),因此 trace/model/hardware 的特性会决定了 SLO 的选择
- 在 trends 的调整中,一个核心的 config knob 可能决定性影响了性能,别的 knobs 很大程度是小修小补(参考 https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104637.jsonl
- Qwen3-30B 在 thinking trace 下,如果使用 SLO 为「SLO (0.001L + 1s) TTFT + 0.02s TPOT」会直接导致大量 config 都 illegal陷入长时间运行最终积累太长的 context 导致超过模型的最长上下文抛出 error
### 长期 TODO
- [x] 支持 tuning 之后在另一段时间的 trace 做 test
- [x] 换一段时间 trace 测试 tuning 得到的 config比较性能结果
- [ ] time scaled trace 会每次 benchmark 生成一份,太过冗余,可修复 [不重要]
- [x] 确认 Qwen3-30B 下 thinking 1200 requests 的结果
- [ ] 按 L_in 分桶的 SLO 满足率/TTFT 分布01k、1k2k、…、4k8k
- [x] 实现一个脚本比较两个 config不做 binary search直接比较相同 time_scale 下的性能
- [ ] 使用内部的 vllm 比较性能
- [x] 使用 `--disable-goodput-best-scale-compare`,得到 trends 的图
- [ ] 实现 explain_tradeoff(a, b) → compare two configs
- [ ] 比较 bailian 的 vllm
```
"best": {"run_id": "agent-3-goodput", "success": true, "aggregated": {"latency_ms_mean": 20224.639, "latency_ms_p50": 14211.797, "latency_ms_p90": 45184.48, "latency_ms_p95": 56841.231, "latency_ms_p99": 86661.03, "ttft_ms_mean": 263.386, "ttft_ms_p50": 180.564, "ttft_ms_p90": 542.365, "ttft_ms_p95": 752.534, "ttft_ms_p99": 1173.739, "tpot_ms_mean": 53.722, "tpot_ms_p50": 51.458, "tpot_ms_p90": 93.986, "tpot_ms_p95": 98.752, "tpot_ms_p99": 107.601, "overall_qps": 6.117, "overall_tokens_per_second": 13851.267873746021, "decode_tokens_throughput": 2285.326, "total_requests": 2114.0, "decode_tokens_total": 789757.0, "tokens_per_second_mean": 26.38564782306047, "tokens_per_second_p50": 19.931698614673515, "tokens_per_second_p95": 58.066780649606464, "tokens_per_second_p99": 85.59372084463882, "request_throughput": 6.117, "notes_count": 1.0, "failed_request_count": 0.0, "ttft_slo_pass_rate": 0.9981078524124882, "goodput_qps": 26.534677249842353, "goodput_qps_per_gpu": 3.316834656230294, "goodput_tps": 55405.071494984084, "goodput_tps_per_gpu": 6925.6339368730105, "goodput_slo_ok": 1.0}, "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "data_parallel_size": 1, "prefill_context_parallel_size": null, "decode_context_parallel_size": null, "block_size": 32, "max_num_batched_tokens": 131072, "max_num_seqs": 1024, "scheduling_policy": null, "gpu_memory_utilization": null, "enable_expert_parallel": null, "all2all_backend": null, "max_model_len": 65536, "enable_prefix_caching": true, "extra": null}}
```
测试 TODO
- [x] qwen3-30b-a3b | traceA | 60min
- [x] qwen3-30b-a3b | thinking | 60min
- [x] qwen3-30b-a3b | coder | 60min
- [x] qwen3-235b-a22b | traceA | 60min
- [x] qwen3-235b-a22b | thinking | 60min
- [x] qwen3-235b-a22b | coder | 60min
#### Misc
vllm latest 和 bailian vllm 的对比:发现 vllm latest 的性能更好
- vllm latest
```json
{
"latency_ms_mean": 13343.749,
"latency_ms_p50": 10129.267,
"latency_ms_p90": 28361.724,
"latency_ms_p95": 36335.392,
"latency_ms_p99": 62868.229,
"ttft_ms_mean": 458.187,
"ttft_ms_p50": 154.414,
"ttft_ms_p90": 1132.518,
"ttft_ms_p95": 2089.66,
"ttft_ms_p99": 3884.956,
"tpot_ms_mean": 35.788,
"tpot_ms_p50": 32.182,
"tpot_ms_p90": 59.733,
"tpot_ms_p95": 64.583,
"tpot_ms_p99": 72.093,
"overall_qps": 2.735,
"overall_tokens_per_second": 5624.112937416796,
"decode_tokens_throughput": 982.428,
"total_requests": 926.0,
"decode_tokens_total": 332623.0,
"tokens_per_second_mean": 35.821798532876514,
"tokens_per_second_p50": 32.8715120938958,
"tokens_per_second_p95": 61.36676048962492,
"tokens_per_second_p99": 120.0,
"request_throughput": 2.735,
"notes_count": 0.0,
"failed_request_count": 0.0,
"ttft_slo_pass_rate": 0.9503239740820735,
"goodput_qps": 24.5244744278846,
"goodput_qps_per_gpu": 3.065559303485575,
"goodput_tps": 44992.90349933437,
"goodput_tps_per_gpu": 5624.112937416796,
"goodput_slo_ok": 1.0
}
```
- bailian
```json
{
"duration_ms": "361609.694",
"e2e_mean_ms": "22361.979",
"e2e_p90_ms": "44006.550",
"e2e_p95_ms": "51007.482",
"e2e_p99_ms": "87502.531",
"output_tokens_total": "332623",
"requests_success": "579",
"requests_total": "926",
"throughput_rps": "2.561",
"throughput_tps": "919.840",
"tpot_mean_ms": "58.537",
"tpot_p90_ms": "84.703",
"tpot_p95_ms": "104.625",
"tpot_p99_ms": "132.758",
"ttft_mean_ms": "826.176",
"ttft_p90_ms": "2040.190",
"ttft_p95_ms": "5247.990",
"ttft_p99_ms": "6321.622"
}
```
```json
{
"duration_ms": "357189.716",
"e2e_mean_ms": "24407.132",
"e2e_p90_ms": "46008.208",
"e2e_p95_ms": "57420.132",
"e2e_p99_ms": "91012.090",
"output_tokens_total": "789757",
"requests_success": "969",
"requests_total": "2114",
"throughput_rps": "5.918",
"throughput_tps": "2211.030",
"tpot_mean_ms": "69.447",
"tpot_p90_ms": "119.914",
"tpot_p95_ms": "135.565",
"tpot_p99_ms": "179.668",
"ttft_mean_ms": "316.557",
"ttft_p90_ms": "790.313",
"ttft_p95_ms": "970.607",
"ttft_p99_ms": "1743.996"
}
```
---
## 测试命令
```bash
python3 scripts/print_vllm_launch_cmd.py --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --config-file test_config.json
```
```bash
vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-235B-A22B --port 10086 --tensor-parallel-size 4 --block-size 64 --max-num-batched-tokens 8192 --max-num-seqs 64 --gpu-memory-utilization 0.8 --max-model-len 131072 --enable-prefix-caching --cudagraph-capture-sizes 16 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 --disable-hybrid-kv-cache-manager --enable-chunked-prefill --kv-cache-dtype fp8 --quantization fp8 --swap-space 0
```
- cmd template
```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only
```
```bash
./run_ai.sh --trace-minutes 60 --max-requests 3600 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-30b-a3b.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-trace-minutes 60 --compare-max-requests 3600
```
```sh
./run_ai.sh --trace-minutes 5 --max-llm-calls 100 --max-steps 100 --ttft-slo 5 --tpot-slo 0.1 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen_traceA_blksz_16.jsonl --time-scale 8.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-30b-a3b.json --min-proposed-runs 16 --disable-memory-context --disable-cache
```
```
./run_ai.sh --trace-minutes 60 --max-llm-calls 30 --ttft-slo 5 --tpot-slo 0.1 --trace-path qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'deepseek-r1' --replica-eval-mode replica_scaling
```
```json
{
"run_config": {
"run_id": "agent-1-probe-0",
"env": {
"hardware_sig": "NVIDIA H20-47121c5fff",
"model_sig": "Qwen3-30B-A3B-f1f030d7e6",
"workload_sig": "qwen_traceA_blksz_16-f006af7ea7"
},
"engine": "trace-replayer",
"vllm_config": {
"tensor_parallel_size": 8,
"pipeline_parallel_size": null,
"data_parallel_size": 1,
"prefill_context_parallel_size": null,
"decode_context_parallel_size": null,
"block_size": 32,
"max_num_batched_tokens": 65536,
"max_num_seqs": 512,
"scheduling_policy": null,
"gpu_memory_utilization": null,
"enable_expert_parallel": true,
"all2all_backend": "allgather_reducescatter",
"max_model_len": 65536,
"enable_prefix_caching": null,
"extra": null
},
"workload": {
"workload_name": "qwen_traceA_blksz_16",
"qps": 40.65387538500513,
"avg_prompt_tokens": 2146.006614190445,
"p95_prompt_tokens": 8515.0,
"avg_decode_tokens": 413.138952662075,
"p95_decode_tokens": 997.0,
"request_type": "trace_replay",
"source_file": "qwen_traceA_blksz_16.jsonl",
"duration_minutes": 7.4999,
"time_scale": "8.0",
"p99_total_tokens": 14357.0
},
"objective": "maxmize goodput per GPU",
"extra": {
"trace_path": "qwen_traceA_blksz_16.jsonl",
"trace_minutes": 5.0,
"time_scale": 8.0,
"min_input_length": null,
"max_input_length": null,
"min_requests": 100,
"show_progress": true,
"model_path": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
"model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
"trace_replayer_bin": "./trace-replayer/target/release/client",
"trace_replayer_endpoint": "http://localhost:10086/v1/chat/completions",
"trace_replayer_tokenizer": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B//tokenizer.json",
"trace_replayer_tokenizer_config": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B//tokenizer_config.json",
"trace_replayer_api": "openai",
"trace_replayer_dataset": "bailian",
"trace_replayer_dataset_path": "qwen_traceA_blksz_16.jsonl",
"trace_replayer_scale_factor": 8.0,
"trace_replayer_time_secs": 300.0,
"trace_replayer_num_producer": "32",
"trace_replayer_channel_capacity": "40960",
"trace_replayer_metric_percentile": "50,90,95,99",
"trace_replayer_early_stop_error_threshold": 3,
"trace_replayer_model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/",
"trace_replayer_engine_launch_cmd": "vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/ --port 10086",
"trace_replayer_engine_launch_cwd": null,
"trace_replayer_engine_ready_timeout_s": 600.0,
"trace_replayer_engine_ready_url": "http://localhost:10086/v1/models",
"trace_replayer_ttft_slo": 5.0,
"trace_replayer_tpot_slo": 0.1,
"early_stop_ttft_ms": null,
"early_stop_tpot_ms": null,
"early_stop_window_s": 60.0,
"early_stop_factor": 10.0,
"early_stop_min_requests": 10,
"goodput_slo_percentile": "p95",
"replica_eval_mode": "replica_scaling",
"workload_qps": 3.2163245151324062,
"workload_cache_key": {
"trace_source": "qwen_traceA_blksz_16.jsonl",
"trace_minutes": 5.0,
"time_scale": 8.0,
"min_input_length": null,
"max_input_length": null,
"sample_policy": "head",
"sample_seed": null,
"max_requests": null,
"trace_replayer_dataset": "bailian",
"trace_replayer_api": "openai"
},
"engine_cache_key": {
"model_path": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
"model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
"trace_replayer_model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/",
"trace_replayer_engine_launch_cmd": "vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/ --port 10086"
},
"evaluation_group_key": "f2831aa3c793",
"replica_scale_factor": 1,
"goodput_total_gpus": 8
},
"use_cache": false
},
"hardware": {
"gpu_type": "NVIDIA H20",
"num_gpus": 8,
"hbm_gb": 95.5771484375,
"nvlink_topology": "nvlink",
"cpu_cores": 160,
"system_memory_gb": 927.5078773498535,
"world_size": 8
},
"model": {
"model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
"param_count": 20767571968,
"hidden_size": 2048,
"num_layers": 48,
"num_heads": 32,
"is_moe": true,
"num_experts": 128,
"is_mla": false,
"num_kv_heads": 4,
"num_experts_per_tok": 8,
"vocab_size": 151936,
"max_position_embeddings": 65536
},
"workload": {
"workload_name": "qwen_traceA_blksz_16",
"qps": 40.65387538500513,
"avg_prompt_tokens": 2146.006614190445,
"p95_prompt_tokens": 8515.0,
"avg_decode_tokens": 413.138952662075,
"p95_decode_tokens": 997.0,
"request_type": "trace_replay",
"source_file": "qwen_traceA_blksz_16.jsonl",
"duration_minutes": 7.4999,
"time_scale": "8.0",
"p99_total_tokens": 14357.0
},
"environment": {
"hardware_sig": "NVIDIA H20-47121c5fff",
"model_sig": "Qwen3-30B-A3B-f1f030d7e6",
"workload_sig": "qwen_traceA_blksz_16-f006af7ea7"
}
}
```
---
# Misc
- 阿里内部异构资源池上 over-provision 的 GPU 资源数据
- 调研学术工作关于资源池横切(负载均衡),竖切(尽可能空出机器)方案的优劣
挑战:
- 异构硬件上如何快速找到合适的 QPSnaive二分
- 异构硬件的算力和带宽都不同,需要的并行模式可能不同?如何快速找到?
- 不同的流量 / 请求 pattern 需要不同的并行模式,如何应对?
- benchmark 与线上流量的差异,如何保证 benchmark 时反应的性能的 fidelity
# 测试规划
1. workload timestamps: timestamp, sim_timestamp_v2, exponential_timestamp
- [x] H20 40min 1.0 scale, 预估 40min * 3 = 2h+
- [ ] 5090 40min 0.5 scale, 预估 4h+
- [x] 5090 120min 0.7 scale
- [x] H20 120min 0.8 scale
- [x] 5090 120min 0.6 scale
- [x] H20 120min 0.7 scale
2. configs: TP: 1, 2, 4, 8, EP: True or False
- [x] H20 30min 1.0 scale, 预估 4h+
- [x] H20 10min 2.0 scale
- [ ] update dash H20 60min 1.0 scale
- [x] 5090 10min 0.5 scale
- [x] H20 5min 0.5 scale
- [x] H20 10min 1 scale
修改 workload目前只修改了 timestamp加入每个 timespan 内使用真实 workload 的 avg input length 和 avg output length是否会有影响
负载大时,不开 EP 比 EP 好,
> gpu_monitor pid: 1315891
### 测试命令 backup
```sh
./run_ai.sh --trace-minutes 60 --max-llm-calls 30 --ttft-slo 12000 --trace-path qwen_traceA_blksz_16.jsonl --time-scale 1.0 --objective 'minimize the latency mean and p95 and keep the throughtput as high as possible' --objective-score-expr '-ttft_ms_p95 - 10 * tpot_ms_mean + 10 * request_throughput' --llm-model 'qwen3-max'
```
```json
{"latency_ms_mean": 40470.771, "latency_ms_p50": 21644.147, "latency_ms_p90": 95732.058, "latency_ms_p95": 144859.163, "latency_ms_p99": 266144.603, "ttft_ms_mean": 313.929, "ttft_ms_p50": 200.65, "ttft_ms_p90": 482.333, "ttft_ms_p95": 781.285, "ttft_ms_p99": 3014.18, "tpot_ms_mean": 49.839, "tpot_ms_p50": 54.346, "tpot_ms_p90": 68.521, "tpot_ms_p95": 69.732, "tpot_ms_p99": 75.753, "overall_qps": 3.623, "overall_tokens_per_second": 23075.872520269895, "decode_tokens_throughput": 2907.208, "total_requests": 2615.0, "decode_tokens_total": 2098488.0, "tokens_per_second_mean": 26.997015256930624, "tokens_per_second_p50": 18.874719191241816, "tokens_per_second_p95": 73.02218276995958, "tokens_per_second_p99": 120.0, "request_throughput": 3.623, "notes_count": 0}
```
---
下面给你一页“可直接贴进论文/slide”的内容**1 个对比表 + 结论 bullets**,并附上一段可直接放进论文 *Methodology / Discussion / Related WorkLLM-as-agent* 的文字(偏学术写法)。
---
- 表 1三份日志的对比摘要可贴论文/slide
> 说明:指标均来自你提供的 `final_summary.best` / `best_run` / `iter_result.aggregated`。单位latency/ttft/tpot 为 msQPS 为 req/s。
- Markdown 版(适合 README/slide
| Case | Workload / Trace | Agent LLM | Best configTP/DP/PP, block, mbt, mseq, mem, EP | Key metricsmean / p95 | 关键现象(系统层) | 关键现象agent 行为) |
| ---- | -------------------------------------------------------------------------------- | -------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| A | **TraceA**60min, time_scale=1.0`qwen_traceA_blksz_16.jsonl` | **deepseek-r1-0528** | **TP=8, DP=1, PP=1**, block=32, mbt=32768, mseq=128, mem=0.95, EP=on | latency **6722 / 17210**; TTFT p95 **241**; TPOT p95 **23.29**; QPS **5.073** | DP-heavy 点出现**灾难性排队/吞吐下降**QPS 2~3latency/TTFT 到 10^6 ms 量级);且 DP>1 部分 run **run_failedtotal_time=None** | 探索更“全局覆盖”TP/DP 极值+大跨度 block/mbt/mseq但存在 **metric 名不一致→proposal_eval missing_metric**baseline 也曾被“不稳定 run”污染 |
| B | **Thinking trace**30min, time_scale=0.5`anony-qwen-max-thinking-…blsz16.jsonl` | **deepseek-r1-0528** | **TP=4, DP=2, PP=1**, block=64, mbt=65536, mseq=128, mem=0.95, EP=on | latency **105986 / 385999**; TTFT p95 **738**; TPOT p95 **35.01**; QPS **1.016** | 输出极长≈3.7k decode tokens/req→ **并发需求 L≈λW≈108**mseq 太小会导致 **TTFT/latency 分钟级爆炸**DP+mseq 是结构性刚需 | 逐步把系统从“队列爆炸区”拉回“稳定区”,再微调 block/mbt但出现 **proposal/执行 config 不一致**(某些 run 字段前后矛盾)风险 |
| C | **TraceA**60min, time_scale=1.0`qwen_traceA_blksz_16.jsonl` | **qwen3-max** | **TP=8, DP=1, PP=1**, block=16, mbt=32768, mseq=256, mem=0.9, EP=on | latency **6590 / 16925**; TTFT p95 **242**; TPOT p95 **23.08**; QPS **5.073** | TraceA 下 **throughput 基本由到达率限定**(各点 QPS≈5.073,容量调参梯度弱)→ 参数扰动大多落在 **plateau** | baseline 直接给较高 mseq/mbt → **一开始就处于近最优**;探索更“局部微扰”(主要在 TP=8 周边动 block/mbt/mseq但发现 **config 落盘/执行不一致**(测试 block=8 实际未生效) |
---
# Ali
- qwen3-30b-a3b 长请求
单机 8 卡 H20 96g2 个模型实例
```json
{
"gpu_memory_utilization": 0.6,
"tensor_parallel_size": 4,
"enable_think": 0,
"think_mode": "auto",
"enforce_eager": false,
"enable_prefix_caching": true,
"enable_chunked_prefix_caching": true,
"distributed_executor_backend": "mp",
"block_size": 256,
"enable_chunked_prefill": true,
"max_num_batched_tokens": 131072,
"max_model_len": 1048576,
"max_seq_len_to_capture": 1048576,
"max_num_seqs": 1,
"compilation_config": {
"cudagraph_capture_sizes": [
1
]
},
"hf_overrides": {
"architectures": [
"Qwen3MoeForCausalLM"
]
},
"kv_cache_dtype": "fp8_e4m3",
"quantization": "fp8",
"rope_scaling": {
"type": "yarn",
"factor": 8,
"original_max_position_embeddings": 131072,
"semi_dynamic": false,
"dynamic": true
}
}
```
- qwen3-30b-a3b 短请求30k-
单机 8 卡 H20 96g8 个模型实例
```json
{
"gpu_memory_utilization": 0.8,
"tensor_parallel_size": 1,
"enable_think": 1,
"think_mode": "auto",
"enforce_eager": false,
"enable_prefix_caching": true,
"distributed_executor_backend": "mp",
"block_size": 256,
"enable_chunked_prefill": true,
"max_num_batched_tokens": 4096,
"max_model_len": 262144,
"max_seq_len_to_capture": 262144
}
```
```json
{
"compilation_config": {
"use_inductor": false,
"full_cuda_graph": false
},
"gpu_memory_utilization": 0.9,
"tensor_parallel_size": 1,
"enable_think": 0,
"think_mode": "auto",
"enforce_eager": false,
"enable_prefix_caching": false,
"distributed_executor_backend": "mp",
"block_size": 256,
"enable_chunked_prefill": true,
"max_num_batched_tokens": 4096,
"max_model_len": 278784,
"rope_scaling": {
"type": "yarn",
"factor": 4,
"original_max_position_embeddings": 262144,
"semi_dynamic": false,
"dynamic": true
},
"speculative_config": {
"method": "eagle3",
"model": "/home/admin/resource/model/464482ce.qwen3-coder-flash-eagle/state_4/",
"num_speculative_tokens": 2,
"draft_tensor_parallel_size": 1,
"rope_scaling": {
"type": "yarn",
"factor": 64,
"original_max_position_embeddings": 4096,
"semi_dynamic": false,
"dynamic": true
}
},
"max_seq_len_to_capture": 131072,
"hf_overrides": {
"architectures": [
"Qwen3MoeForCausalLM"
],
"model_type": "qwen3_moe"
},
"quantization": "fp8"
}
```
# Backup
```
cmd=['trace-replayer/target/release/client', '--stream', '--tokenizer', '/mnt/debugger/wjh/models/Qwen3-30B-A3B//tokenizer.json', '--tokenizer-config', '/mnt/debugger/wjh/models/Qwen3-30B-A3B//tokenizer_config.json', '--endpoint', 'http://localhost:10086/v1/chat/completions', '--api', 'openai', '--dataset', 'bailian', '--dataset-path', './qwen_traceA_blksz_16.jsonl', '--scale-factor', '2.0', '--time-in-secs', '600', '--model-name', '/mnt/debugger/wjh/models/Qwen3-30B-A3B/', '--metric-percentile', '50,90,95,99', '--output-path', 'runs/ai_tuner_store/nvidia-h20-47121c5fff__mnt-debugger-wjh-models-qwen3-30b-a3b-10c4a49394__qwen_tracea_blksz_16-d451afd9af/260108-172937/client.jsonl', '--summary-path', 'runs/ai_tuner_store/nvidia-h20-47121c5fff__mnt-debugger-wjh-models-qwen3-30b-a3b-10c4a49394__qwen_tracea_blksz_16-d451afd9af/260108-172937/summary.json', '--num-producer', '32', '--channel-capacity', '40960']
```
是不是可以通过少量 (HW, model, workload) 的测试组合作为先验知识,指导 AI 快速找到新的 (HW, model, workload) 下的最优 config从而大大加速填表过程
设置优化目标,不是让 AI 随意尝试而是有针对性的分析memory/compute/comm/etc. bound从优化目标出发修改 config
异构硬件资源池上的模型性能调优是不是伪需求?
- 头部模型占据绝大多数流量,基本只会使用规模最大的固定卡型来服务
- 考虑长尾模型,虽然数量上有数百个而且可能是有什么卡能用就用什么卡,但是占据的总体流量只有不到 1.5%,为它们优化性能可能基本没有收益(本身模型参数规模就不大,达到 SLO 比较容易;优化性能的成本不一定值得线上流量的开销)
---
### LaTeX 版(可直接贴论文)
> 你可以把 `\small`/列宽再按版面微调这是一张“table*”风格,适合双栏论文。
```latex
\begin{table*}[t]
\centering
\small
\setlength{\tabcolsep}{4pt}
\begin{tabular}{p{1.1cm} p{3.2cm} p{2.0cm} p{5.4cm} p{4.7cm} p{4.7cm}}
\hline
Case & Workload / Trace & Agent LLM &
Best config (TP/DP/PP, block, mbt, mseq, mem, EP) &
Key metrics (mean / p95) &
Notable observations (system \& agent) \\
\hline
A &
TraceA (60min, scale=1.0) \texttt{qwen\_traceA\_blksz\_16} &
DeepSeek-R1 &
TP=8, DP=1, PP=1; block=32; mbt=32768; mseq=128; mem=0.95; EP=on &
lat 6722 / 17210 ms; TTFT p95 241 ms; TPOT p95 23.29 ms; QPS 5.073 &
DP-heavy points show instability (QPS drops, queue blow-up); some DP>1 runs fail (missing total\_time). Agent explores globally but suffers metric-schema mismatch and unstable baselines. \\
\hline
B &
Thinking trace (30min, scale=0.5) \texttt{anony-qwen-max-thinking-...} &
DeepSeek-R1 &
TP=4, DP=2, PP=1; block=64; mbt=65536; mseq=128; mem=0.95; EP=on &
lat 105986 / 385999 ms; TTFT p95 738 ms; TPOT p95 35.01 ms; QPS 1.016 &
Long outputs imply high concurrency demand (L\approx \lambda W \approx 108); small mseq triggers minute-level TTFT/latency blow-ups. Agent converges via DP+mseq, but config provenance inconsistency is observed. \\
\hline
C &
TraceA (60min, scale=1.0) \texttt{qwen\_traceA\_blksz\_16} &
Qwen3-Max &
TP=8, DP=1, PP=1; block=16; mbt=32768; mseq=256; mem=0.9; EP=on &
lat 6590 / 16925 ms; TTFT p95 242 ms; TPOT p95 23.08 ms; QPS 5.073 &
Arrival-rate-limited regime yields a flat performance plateau; agent starts near-optimal and performs local perturbations. Execution-time config mismatch observed (intended block change not applied). \\
\hline
\end{tabular}
\caption{Comparison across three tuning logs. TraceA exhibits an arrival-rate-limited plateau where throughput provides little gradient; thinking trace is concurrency-limited and requires DP+mseq to avoid queue blow-up. Agent backbones differ in schema compliance and exploration style; system-level guardrails are necessary for reliable learning.}
\label{tab:agent-compare}
\end{table*}
```
---
- 一页结论 bullets适合 slide / paper “Findings”
Workload 侧(决定“该搜什么”)
* **TraceAQPS≈5是 arrival-rate-limited**:多数稳定点 `overall_qps≈5.07`throughput 不提供优化梯度;主要优化信号来自 **TTFT tail / TPOT / 排队结构**。
* **Thinking trace 是 concurrency-limited**:长输出导致 `L=λW` 很大(你这份约 100+`max_num_seqs` 过小会触发 **分钟级 TTFT/latency 爆炸**;必须把 **DP + mseq** 纳入核心搜索维度。
* **同一模型/硬件下,最优并行形态随 workload 改变**TraceA 偏 **TP=8, DP=1**Thinking 偏 **TP=4, DP=2**(在满足并发容量前提下再保 TPOT
Agent 侧(决定“怎么搜”)
* **DeepSeek-R1 更偏全局覆盖探索**:容易覆盖到危险区域(不稳定/失败)从而揭示边界,但对 metric schema/稳定性 gate 更敏感(否则反馈闭环失真)。
* **Qwen3-Max 更偏强 baseline + 局部微扰**:在 plateau 场景TraceA收敛更快但若迁移到 concurrency-limited 场景而仍局部搜索,可能长期卡在不稳定区边缘。
* **契约遵守schema compliance决定自动调参有效性**metric 字段名不一致会直接造成 `proposal_eval` 失真;比“模型聪明与否”更先影响系统表现。
系统/工具链侧(决定“能不能复现/能不能学”)
* **需要稳定性 gate**:将队列发散(µ<λ run 从比较/学习集合中剔除否则 baseline/reward 被污染
* **需要 plateau 检测与 early-stop**TraceA workload 极易出现多点指标几乎相同的平台区继续探索浪费 budget
* **必须保证 config provenance提议=执行=记录)一致**:你两份日志都出现“提议参数未真正执行/落盘不一致”,这会摧毁 ablation 与可解释归因
---
- 论文可用段落Methodology / Discussion
> 下面这段写法偏“方法学+实验观察”,不依赖外部引用,可直接放到论文中(你可把模型名/trace 名替换成更抽象的符号,避免过多工程细节)。
**中文学术写法:**
在本工作中我们将大语言模型LLM作为调参代理agent通过提出候选配置运行真实回放基准依据指标反馈迭代的闭环对推理引擎的并行策略与批处理参数进行自动搜索我们对不同 agent backbone 的行为差异进行了对照DeepSeek-R1 风格的代理更倾向于进行全局覆盖式探索如跨 TP/DP 极值及大跨度 batch/seq 设定能够较快暴露不稳定区域与系统边界 Qwen3-Max 风格的代理更倾向于从较强 baseline 出发做局部微扰在到达率受限arrival-rate-limited的工作负载上迅速进入性能平台区plateau)。进一步地我们观察到 agent 能力并非唯一决定因素若指标 schema 不统一例如提案中使用的指标名与实际输出字段不一致或配置落盘/执行不一致则会导致反馈信号失真从而显著削弱自动搜索的有效性与可解释性因此我们强调必须在系统层引入三类 guardrails其一是稳定性 gate剔除队列发散的 run避免 baseline/reward 污染其二是 plateau 检测与自适应 early-stop在到达率受限场景中减少无效探索其三是配置 provenance 机制确保提议执行记录三者一致以支持严谨的 ablation 与可复现实验结论
**英文论文写法(可选,如果你要投国际会议):**
We treat an LLM as an optimization agent that proposes serving configurations, executes trace-replay benchmarks, and iterates based on measured latency/throughput feedback. Across agent backbones, we observe distinct search styles: a DeepSeek-R1-based agent tends to perform global exploration over extreme TP/DP and batching settings, which quickly surfaces instability boundaries but is sensitive to metric-schema and stability handling; a Qwen3-Max-based agent starts from a strong baseline and applies local perturbations, reaching the performance plateau rapidly under arrival-rate-limited workloads. Importantly, agent capability alone does not guarantee effective tuning: metric-schema mismatches (proposal fields vs. logged metrics) and configuration provenance inconsistencies (proposed vs. executed settings) can corrupt the feedback loop and hinder explainability. We therefore introduce three system-level guardrails for reliable LLM-driven tuning: (1) a stability gate to exclude queue-divergent runs from comparison/learning, (2) plateau detection with adaptive early stopping to avoid wasting budget in flat regions, and (3) configuration provenance enforcement to ensure proposalexecutionlogging consistency and enable rigorous ablations.
---
如果你希望把这页做成更像论文图表的形式例如把 **Case A/B/C** best config 画成一个小型并行/队列示意图或把 **arrival-rate-limited vs concurrency-limited** 做成两栏对比我也可以按你常用的 slide 风格给出一页排版稿标题两列布局每列 35 bullets)。