obsidian/projects/auto-tuner/sync2.md

## 10.20

#### 1. 调研学习 vLLM 的 DBO 实现细节，确认 vLLM 上实现一个给定执行流自动应用的可能性

一个 config 表示了不同模块的切分与 overlap 方案
![[projects/auto-tuner/sync2.figs/260410-105227.png]]
甚至在 dispatch 和 combine 中也存在 pipeline
![[projects/auto-tuner/sync2.figs/260410-105227-1.png]]

```python
# on hash 55392bc8

# vllm/forward_context.py:233
def create_forward_context(
    attn_metadata: Any,
    vllm_config: VllmConfig,
    virtual_engine: int = 0,
    dp_metadata: Optional[DPMetadata] = None,
    cudagraph_runtime_mode: CUDAGraphMode = CUDAGraphMode.NONE,
    batch_descriptor: Optional[BatchDescriptor] = None,
    ubatch_slices: Optional[UBatchSlices] = None,
):
    return ForwardContext(
        no_compile_layers=vllm_config.compilation_config.static_forward_context,
        virtual_engine=virtual_engine,
        attn_metadata=attn_metadata,
        dp_metadata=dp_metadata,
        cudagraph_runtime_mode=cudagraph_runtime_mode,
        batch_descriptor=batch_descriptor,
        ubatch_slices=ubatch_slices,
    )


# vllm/v1/worker/gpu_model_runner.py:1197
ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
	num_tokens_unpadded=num_tokens_unpadded,
	parallel_config=self.parallel_config,
	allow_microbatching=True,
	allow_dp_padding=allow_dp_padding,
	num_tokens_padded=num_tokens_padded,
	uniform_decode=uniform_decode,
	num_scheduled_tokens_per_request=num_scheduled_tokens,
)


# vllm/model_executor/layers/fused_moe/modular_kernel.py:658
# class FusedMoEModularKernel
if hook is not None:
	if dbo_enabled():
		# If DBO is being used, register the hook with the ubatch
		# context and call it in dbo_maybe_run_recv_hook instead of
		#  passing it to the receiver.
		dbo_register_recv_hook(hook)
		dbo_yield()
	else:
		hook()

receiver()


# vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py:96
def _do_dispatch(
	self,
	tokens: torch.Tensor,
	token_scales: Optional[torch.Tensor],
	rank_topk_ids: torch.Tensor,
	rank_topk_weights: torch.Tensor,
	num_experts: int,
	a1_scale: Optional[torch.Tensor],
	quant_config: FusedMoEQuantConfig,
) -> Callable:
	has_scales = token_scales is not None

	# We yield before launching the dispatch kernel since the dispatch
	# kernel will block the CPU so we want to queue up all the compute
	# for the other ubatch before the dispatch kernel starts.
	dbo_yield_and_switch_from_compute_to_comm()
```

#### 2. 调研 qwen 线上模型情况

> 苏立

From Tao He:

> 线上模型上线需要
> - 精度对齐（以 transformers 做 ground truth），符合 naive 直觉：换用更高精度的 kernel 总能 work，判断 kernel 本身的 bug（如何定义？）还是精度问题
> - 性能调优：大多 case by case，需要很多从第一性原理出发的调优，针对模型的新 feature 实现新优化，GDNAttention（Qwen-Next）
> - 稳定性测试：长短 token、长期推理的稳定性

#### TODO

调研 Qwen 线上的针对性优化有哪些，a list for feature -> optimization，从中总结可以抽象的共同部分，确认 Qwen 等其它模型为什么不用 DBO

确认自动并行模式搜索、DBO 为什么不在 Qwen 上使用？不重要吗？Qwen 手动优化时会优化这些吗？

~~测试 NanoFlow？~~


## 10.27

#### 1. 调研 Qwen models 优化现状

[[ali-optimization]]

#### 2. 调研 ali 手动调优 parallelism config 现状

- 对于传统的只跑 TP 的模型，根据经验选择能跑起来的最小 TP（指 70B-TP4, 235B-TP8 这样）通常是最好的
- 手动优化主要集中在引入 EP 后的模型，目前手动调优的模型在 5、6 个，每个模型的测试调优时间成本在 2 天左右
- 在 EPxDP 时 调度器 影响更大（同样的 parallelism config，不同调度模式会影响 5~10% 的性能），在线的并行模式调整，可能不如调度器的调整，让每个 rank 收到的 pattern 相对固定
- 核心痛点在于：压测不能反应线上真实流量（Q：replay 线上流量的 1、2h 进行测试？A：可能会导致调优时间大大增加），目前认为主要的影响因素：长度 pattern、QPS

#### TODO

DashGen 参数目前的规范作为 baseline

陶乾：异构硬件部署模型的建模


## 11.03

#### 1. 调研 DashGen 现状

目标：根据模型 feature（是否 MoE、是否 VL、是否 quant 等）在指定硬件平台自动生成一个配置文件

现状：几个常用的模型（Qwen3 最近的一些模型）有一个 H20 上的配置文件，无别的模型、别的硬件平台的配置文件，配置文件由人工编写

目前针对不同模型调整较多的参数有：
- `block_size`
- `max_model_len`
- `max_num_batched_tokens`
- `max_num_seqs`
- `rope_scaling.factor`
- `rope_scaling.original_max_position_embeddings`
- `speculative_config.num_speculative_tokens`
- `speculative_config.rope_scaling.factor`
- `speculative_config.rope_scaling.original_max_position_embeddings`

#### 2. 设计系统原型 & 系统搭建

目前完成 hardware prober 与 workload profiler

![[projects/auto-tuner/sync2.figs/260410-105227-2.png]]

#### TODO

把 DashGen 当作 train dataset
进一步总结可以 config 调整的优化
meta analysis 确定调 config 的优化空间
陶乾：异构硬件部署模型的建模


## 11.12

#### 1. 系统原型实现 config generator

如果暴力的 grid search，config 太多，一个简单的 8 卡机就会有 20+ configs，每组使用开源 trace benchmark 跑 1h 也需要 1 天的卡时，相比人工调优的时间成本没有降低

挑战：如何注入一些基本知识来剪枝，减少搜索空间

#### 2. 确认调整 config 的优化能力

- 来自同事的数据
	> 对于内部实际上线的 Qwen3-Max (fp4) ，从 TP8DP1EP8 切到 TP1DP8EP8 的性能（满足 SLO 时的 thpt）提升有 40%。正在尝试复现这个测试

- 我们的 micro-benchmark: Qwen3-30B-A3B

|           | ttft_s_mean | tpot_s_mean | ttft_s_p90 | tpot_s_p90 |
| --------- | ----------- | ----------- | :--------- | :--------- |
| TP8DP1EP8 | 7.9351      | 0.1022      | 34.8006    | 0.1573     |
| TP1DP8EP8 | 0.1765      | 0.0591      | 0.3776     | 0.0628     |
| TP8DP1EP1 | 3.2645      | 0.0619      | 13.0335    | 0.0809     |
| TP1DP8EP1 | 0.1647      | 0.0544      | 0.3553     | 0.0578     |

#### 3. 关于公司的异构硬件现状

考虑服务 $S_1, S_2, \cdots, S_u$，模型 $M_1, M_2, \cdots, M_v$，资源池 $R_1, R_2, \cdots R_w$。
单机吞吐 $t \in T^{v \times w}$，$T_{i, j}$ 表示模型 $i$ 在资源池 $j$ 上能够提供的吞吐

主要考虑的是对资源的调整：

1. 无模型比例
	模型需求的吞吐为一个值，例如 (M1, M2) 需求吞吐 (5, 10) 变为 (8, 8)，动态切换，拆东墙补西墙 + 分配/释放资源
2. 模型吞吐有严格比例要求
	例如 M1:M2 = 2:8 （背景：A/B test、把部署打散增强 robustness）
	我个人观点：agent 可能会有比较高的这种需求，在一个 workflow 上，不同模型的流量比例有严格要求
	计算：虚拟吞吐 = 分配资源量 / 需求比例，虚拟吞吐越高，说明分配越多，可以进行资源切换


#### TODO

找到 case 验证 search 能比人工调优更好的 config


## 11.26

#### 1. 尝试测试 Qwen3-max-fp4

结论：线上部署使用的是 141G 版本的 H20，8 卡 96G 只能勉强 load fp4 参数，此时会出现报错导致无法跑 prefill（线上已经是 PD 分离）

> AssertionError: deepep fp4 only support decode, please run fp8 for prefill reqs

#### 2. 对比线上的 Qwen-30B-A3B

> 线上采用的是单卡运行，无任何 parallelism

由于 _fun fact_，所谓的不开 EP 时的 TP1DP8 并不是我们理解的 DP8，需要确定传统意义的 DP8 和 vllm 的 DP8 的性能对比。

- 不同硬件 / 不同负载 都会导致并行模式变化时，推理性能的变化趋势、拐点不同
- 小 QPS 时，大 TP 能够减少 latency；大 QPS 时，大 DP 能减少 latency？
- 在 5090 与 H20 上，出现的拐点不同

![[projects/auto-tuner/sync2.figs/260410-105227-3.png]]

#### 3. workload 生成与测试

我们需要对 workload 有一个 spec 描述，抽象流量上升沿、平稳、下降沿，内部使用指数分布生成。

![[projects/auto-tuner/sync2.figs/260410-105227-4.png]]

![[projects/auto-tuner/sync2.figs/260410-105227-5.png]]

整体负载较低时，不同的请求到来时间戳对性能影响相对较小。反之整体负载较高时，不同请求的到来时间戳对性能影响较大。

#### Fun facts

MoE 的 `enable-expert-parallel=True/False` 与我们想象的行为不同，即使 `TP=1, DP=2, enable-expert-parallel=False`，MoE 部分 expert 的也会 **TP 切分**。开启 EP 之后会按照 experts 的维度切分，不同卡上有不同 experts

> model_executor/layers/fused_moes/config.py:FusedMoeParallelConfig:make


## 12.03

#### 1. 更新 workload 的抽象 spec 描述

支持识别每一个 workload 的 QPS 上升、下降、平稳沿。每一个 timespan 内可以生成对应的 input_length, output_length, KVCache hit ratio


![[projects/auto-tuner/sync2.figs/260410-105227-6.png]]

![[projects/auto-tuner/sync2.figs/260410-105227-7.png]]

#### 2. 调研 sys intelligence

https://github.com/sys-intelligence/system-intelligence-benchmark
https://www.sigops.org/2025/glia-a-human-inspired-ai-for-systems-design-and-optimization/

结论：
- 给了几个 example：cache policy、AE 自动化
- 没有迭代的过程，不知道 evolve 的过程

#### 3. 完成 AI tuner 系统的初版实现

```json
{"tag": "seed_result", "run_id": "seed-0", "config": {"tensor_parallel_size": 1, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 8, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 13, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 12176.020989130817, "latency_ms_p50": 9204.860296100378, "latency_ms_p95": 35188.60311061144, "latency_ms_p99": 45363.17807715386, "ttft_ms_mean": 5719.615500097759, "ttft_ms_p50": 1781.3231199979782, "ttft_ms_p95": 24910.35631671548, "ttft_ms_p99": 35153.06737739593, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.1380589240322, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "seed_result", "run_id": "seed-1", "config": {"tensor_parallel_size": 1, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 8, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 13, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 12345.290610685028, "latency_ms_p50": 9480.132822878659, "latency_ms_p95": 34781.647122465074, "latency_ms_p99": 43191.735486499965, "ttft_ms_mean": 5892.288800230696, "ttft_ms_p50": 2097.73010853678, "ttft_ms_p95": 24136.49755809456, "ttft_ms_p99": 33699.99936595559, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 628.866941701046, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "seed_result", "run_id": "seed-2", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 6, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 13475.863611711437, "latency_ms_p50": 10794.865541160107, "latency_ms_p95": 36951.33968722075, "latency_ms_p99": 51772.90129289031, "ttft_ms_mean": 10182.089378618242, "ttft_ms_p50": 7472.564217634499, "ttft_ms_p95": 32121.336066164076, "ttft_ms_p99": 48282.36513398588, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.5132308205193, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "seed_result", "run_id": "seed-3", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8642, "max_num_seqs": 6, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 39164.60290491039, "latency_ms_p50": 31036.402412690222, "latency_ms_p95": 105731.39869049191, "latency_ms_p99": 124533.90708565712, "ttft_ms_mean": 35547.060281754864, "ttft_ms_p50": 27486.676642671227, "ttft_ms_p95": 100853.97887509316, "ttft_ms_p99": 121570.28573192656, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.0048339375755, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "llm_response", "step": 0, "analysis": "The workload uses Qwen3-30B-A3B, a MoE model with 128 experts. The current runs show that enabling expert parallelism (ep_enabled=true) under TP=2 leads to severe latency degradation (p95 >100s vs ~35s when disabled), likely due to inefficient expert routing or communication overhead across TP groups. In contrast, with TP=1 and DP=8, enabling EP slightly improves latency (~400ms better p95). This suggests that EP works better when all experts fit within a single device or when inter-device expert communication is minimized. All runs use block_size=32, max_tokens=8642 (≈8192+450), max_seqs=13 or 6—both non-standard values violating domain rules. The scheduler may be suboptimal due to arbitrary max_num_seqs and max_num_batched_tokens not aligned to powers of two. GPU memory utilization is stable (~28GB per GPU on H20 with 95GB HBM), so memory pressure isn't the issue. High GPU util (>97%) indicates compute saturation, but poor latency suggests scheduling or batching inefficiencies rather than raw FLOPs shortage. The strong correlation between sequence length and latency (corr=0.48) implies batching isn’t effectively overlapping long prompts with decodes.", "num_candidates": 2, "should_stop": false}

{"tag": "iter_result", "run_id": "iter-0-0", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 8, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 4628.36767861444, "latency_ms_p50": 3765.7501455396414, "latency_ms_p95": 11333.310923539102, "latency_ms_p99": 17040.048834867775, "ttft_ms_mean": 1065.5285831100412, "ttft_ms_p50": 221.42069786787033, "ttft_ms_p95": 4601.52723826468, "ttft_ms_p99": 8305.568953976035, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 632.1981914405968, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "iter_result", "run_id": "iter-0-1", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 8, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3760.064374331547, "latency_ms_p50": 3151.7164893448353, "latency_ms_p95": 9186.337321065366, "latency_ms_p99": 15386.448743753135, "ttft_ms_mean": 540.9742441223084, "ttft_ms_p50": 59.69636607915163, "ttft_ms_p95": 2516.9906625524163, "ttft_ms_p99": 4393.859929405153, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.4289286160771, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "llm_response", "step": 1, "analysis": "The workload uses a MoE model (128 experts) with moderate QPS (1.7) and long prompts (avg 1875 tokens). Recent runs show that enabling expert parallelism (ep_enabled=true) with tp=2 leads to severe latency degradation (p95 >100s in seed-3), but when max_tokens is reduced to 8192 and max_seqs increased to 8, latency improves dramatically (p95 ~11s in iter-0-0). Disabling EP further reduces p95 to ~9s (iter-0-1). Both iter-0-* runs achieve high GPU utilization (>97%) and similar token/s, indicating compute saturation is not the issue. The dramatic latency difference between seed-3 and iter-0-0 (same TP/EP config) is due to max_tokens=8642 vs 8192 and max_seqs=6 vs 8 — the former likely causes KV cache fragmentation or inefficient batching. The strong correlation between sequence length and latency (~0.47–0.48) suggests scheduling inefficiency under long sequences. Since memory utilization is stable at 0.9 and HBM usage is far below 95% (28GB used on 95GB GPUs), memory is not the bottleneck. The primary issue appears to be poor batch scheduling under large max_tokens and low max_seqs, exacerbated by expert routing overhead when EP is enabled.", "num_candidates": 3, "should_stop": false}

{"tag": "iter_result", "run_id": "iter-1-0", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 4096, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3492.389837023313, "latency_ms_p50": 2872.1146723255515, "latency_ms_p95": 9010.221988894045, "latency_ms_p99": 15249.355898238719, "ttft_ms_mean": 82.92471907821788, "ttft_ms_p50": 41.2894356995821, "ttft_ms_p95": 343.0500505492091, "ttft_ms_p99": 709.19309835881, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.5222736175356, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "iter_result", "run_id": "iter-1-1", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 8192, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": false, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3495.5934863037082, "latency_ms_p50": 2866.79108440876, "latency_ms_p95": 8992.029128596187, "latency_ms_p99": 15476.51156410575, "ttft_ms_mean": 80.40708293078234, "ttft_ms_p50": 42.12937131524086, "ttft_ms_p95": 317.7981022745371, "ttft_ms_p99": 683.7136130779982, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.5214401800225, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "iter_result", "run_id": "iter-1-2", "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "expert_parallel_size": 1, "data_parallel_size": 4, "block_size": 32, "max_num_batched_tokens": 4096, "max_num_seqs": 16, "scheduling_policy": "fcfs", "gpu_memory_utilization": 0.9, "enable_expert_parallel": true, "max_model_len": 40960, "enable_prefix_caching": null, "admission_policy": null, "extra": null}, "aggregated": {"latency_ms_mean": 3860.5031585243487, "latency_ms_p50": 3192.2978619113564, "latency_ms_p95": 9960.870377719402, "latency_ms_p99": 16528.3727562055, "ttft_ms_mean": 95.49012954670319, "ttft_ms_p50": 45.08622735738754, "ttft_ms_p95": 396.9422159716487, "ttft_ms_p99": 798.2364166527987, "overall_qps": 1.706046589156978, "overall_tokens_per_second": 631.4522648664417, "total_requests": 2047.0, "notes_count": 3.0}, "success": true}

{"tag": "llm_response", "step": 2, "analysis": "The workload uses Qwen3-30B-A3B, a large MoE model with 128 experts. Recent runs show that enabling expert parallelism (ep_enabled=true) consistently degrades latency significantly (e.g., p95 latency jumps from ~9s to >100s in seed-3 vs seed-2). However, in later iterations with smaller max_tokens and higher max_seqs (iter-1-1 vs iter-1-2), the gap narrows but ep still adds ~1s to p95 latency. GPU utilization is already very high (~98%), indicating compute saturation. Memory usage is stable (~28GB per GPU on H20 with 95GB HBM), so memory pressure is not the issue. The strong negative impact of expert parallelism suggests inefficient routing or communication overhead among experts across data-parallel ranks when ep is enabled. Given the objective of low latency, disabling expert parallelism is strongly preferred. Among ep-disabled runs, increasing max_num_seqs from 8 to 16 (iter-0-1 → iter-1-1) dramatically improves TTFT (from 2.5s to ~300ms) with minimal latency increase, indicating better scheduling efficiency. Further increasing max_num_batched_tokens beyond 8192 is unnecessary as prompt+decode tokens per request rarely exceed ~2250 on average, and 8192 already allows batching multiple long sequences. Block size of 32 is standard and shows no signs of inefficiency.", "num_candidates": 1, "should_stop": true}
```

#### TODO

- 测试 AI tuner 与 expert 方案的性能对比
- 调研阿里内部异构资源池上 over-provision 的 GPU 资源数据
- 阿里内部异构资源池的 hardware/model 的比例，各 model 的流量比例，以及线上的 config
- 工业现状的限制：工业实际的不同硬件算力能够提供的能力、

## 12.24

(0, 4k, 32k)

| Workload           | Config        | Max QPS     | TTFT p95 | TPOT p95 | TTFT mean | TPOT mean |
| ------------------ | ------------- | ----------- | -------- | -------- | --------- | --------- |
| input>=0,<4096     | tp=1,dp=8     | 14.66       | 370.6    | 47.3     | 158.7     | 35.1      |
| input>=0,<4096     | ==tp=2,dp=4== | ==21.48==   | 232.0    | 48.4     | 115.4     | 39.5      |
| input>=0,<4096     | tp=4,dp=2     | 19.26       | 154.8    | 34.9     | 86.3      | 28.2      |
| input>=0,<4096     | tp=8,dp=1     | 17.48       | 995.6    | 21.2     | 166.2     | 18.2      |
| input>=4096,<32768 | tp=1,dp=8     | 1.30        | 1859.5   | 48.1     | 922.8     | 27.2      |
| input>=4096,<32768 | tp=2,dp=4     | 2.19        | 1062.9   | 37.1     | 500.7     | 24.5      |
| input>=4096,<32768 | tp=4,dp=2     | 3.67        | 1014.7   | 48.1     | 419.1     | 34.8      |
| input>=4096,<32768 | ==tp=8,dp=1== | ==3.67==    | 846.4    | 41.1     | 334.6     | 34.1      |
| input>=32768       | tp=1,dp=8     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=2,dp=4     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=4,dp=2     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=8,dp=1     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |

(0, 8k, 32k)

| Workload           | Config        | Max QPS     | TTFT p95 | TPOT p95 | TTFT mean | TPOT mean |
| ------------------ | ------------- | ----------- | -------- | -------- | --------- | --------- |
| input>=0,<8192     | tp=1,dp=8     | 8.42        | 720.3    | 48.3     | 259.0     | 33.6      |
| input>=0,<8192     | tp=2,dp=4     | 13.60       | 440.5    | 48.4     | 166.0     | 37.4      |
| input>=0,<8192     | ==tp=4,dp=2== | ==15.28==   | 295.2    | 39.6     | 117.6     | 31.6      |
| input>=0,<8192     | tp=8,dp=1     | 13.45       | 280.7    | 24.7     | 96.7      | 20.1      |
| input>=8192,<32768 | tp=1,dp=8     | 0.65        | 2397.1   | 32.4     | 1433.3    | 21.4      |
| input>=8192,<32768 | tp=2,dp=4     | 1.41        | 1438.8   | 39.4     | 803.9     | 27.0      |
| input>=8192,<32768 | ==tp=4,dp=2== | ==2.33==    | 1100.0   | 47.0     | 556.6     | 35.6      |
| input>=8192,<32768 | tp=8,dp=1     | 2.18        | 1148.9   | 40.2     | 468.1     | 30.5      |
| input>=32768       | tp=1,dp=8     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=2,dp=4     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=4,dp=2     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |
| input>=32768       | tp=8,dp=1     | no-slo-pass | 0.0      | 0.0      | 0.0       | 0.0       |


10min:

| **Workload (Input Tokens)** | **Config**     | **Max QPS** | **TTFT p95 (ms)** | **TPOT p95 (ms)** | **TTFT mean (ms)** | **TPOT mean (ms)** |
| --------------------------- | -------------- | ----------- | ----------------- | ----------------- | ------------------ | ------------------ |
| **0 < 4096**                | tp=1, dp=8     | 13.77       | 370.8             | 47.9              | 163.7              | 35.9               |
| **0 < 4096**                | ==tp=2, dp=4== | ==17.92==   | 226.5             | 48.8              | 118.3              | 36.1               |
| **0 < 4096**                | tp=4, dp=2     | 17.03       | 206.4             | 39.4              | 126.1              | 29.5               |
| **0 < 4096**                | tp=8, dp=1     | 14.51       | 367.8             | 24.5              | 102.1              | 19.8               |
| **4096 < 32768**            | tp=1, dp=8     | 1.30        | 1756.4            | 44.2              | 885.7              | 27.0               |
| **4096 < 32768**            | tp=2, dp=4     | 2.19        | 1097.6            | 43.8              | 515.1              | 28.3               |
| **4096 < 32768**            | tp=4, dp=2     | 3.52        | 873.4             | 49.4              | 385.8              | 35.1               |
| **4096 < 32768**            | ==tp=8, dp=1== | ==3.67==    | 919.7             | 48.1              | 349.0              | 34.0               |
| **>= 32768**                | tp=1, dp=8     | no-slo-pass | 0.0               | 0.0               | 0.0                | 0.0                |
| **>= 32768**                | tp=2, dp=4     | no-slo-pass | 0.0               | 0.0               | 0.0                | 0.0                |
| **>= 32768**                | tp=4, dp=2     | no-slo-pass | 0.0               | 0.0               | 0.0                | 0.0                |
| **>= 32768**                | tp=8, dp=1     | no-slo-pass | 0.0               | 0.0               | 0.0                | 0.0                |

## 01.07

#### 针对 AITuner 的结论

Observations:
- 不同目标（p95 TTFT / E2E）的 best config 不同
- AI tuner 会多次 iters 中保持类似的性能，探索空间具有局限性
- 即使 prompt 中明确让 LLM 在判断是否能够中止迭代（性能已经稳定优），但 LLM 没有给出过这种判断，都迭代到了 max_step

通过定义结构化的输入：

`HardwareProfile + WorkloadProfile + BenchmarkHistory + SampledRequestBenchmark + HardwareUtilization`

LLM 能够给出一系列符合逻辑的分析：

- 是否开启 EP 对性能的影响
- 不同 TP 变化时性能的变化趋势
- workload profile 结果的分析（long prompts p95）
- runtime GPU utilization for compute/memory-bound

```
Recent runs show that configurations with higher tensor parallelism (tp=4) and no expert/pipeline parallelism yield significantly lower latency (iter-0-0: 30.8ms p95). However, max_num_batched_tokens=16384 may be insufficient for long prompts (p95=17465 tokens), risking truncation. Enabling expert parallel consistently degrades performance, while pipeline parallelism (iter-0-1) causes severe underutilization (58.3% GPU util) and high latency. The best configuration (iter-0-0) achieves competitive throughput but requires adjustments to handle long prompts safely.

The workload has long prompts (p95=17.4k tokens) and moderate decode lengths. The best run (iter-1-1) achieved low latency with tp=8, pp=1, dp=1 config but showed only 72.7% avg GPU utilization, indicating underutilization. High GPU utilization in iter-1-2 (85.8% avg) suggests compute-bound bottlenecks in some configurations. Expert parallel consistently underperforms, and pipeline parallelism introduces significant latency. The primary opportunity is to optimize batch scheduling for long prompts while maintaining low latency.
```

- Trace A
![[projects/auto-tuner/sync2.figs/260410-105227-8.png]]
- Trace A | 0-8k input_length
```
P95 TTFT: 70.37ms / 791.39ms = 0.089
TP8 + PP1 + DP1 + BLKSZ 128 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 64
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 6943 + MAX_SEQS 38

Mean Lat: 2979.62ms vs 7244.47ms = 0.411
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 64
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 6943 + MAX_SEQS 38
```
![[projects/auto-tuner/sync2.figs/260410-105227-9.png]]
- Trace A | 8k+ input_length
```
P95 TTFT: 912.25ms / 431602.90ms = 0.002
TP4 + PP1 + DP2 + BLKSZ 64 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 18905 + MAX_SEQS 11

Mean Lat: 6680.46ms / 264119.64ms = 0.025
TP8 + PP1 + DP1 + BLKSZ 64 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 18905 + MAX_SEQS 11
```
![[260410-105227-10.png]]
- Thinking
```
P95 TTFT: 465.83ms / 1159.39ms = 0.402
TP2 + PP1 + DP4 + BLKSZ 256 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 256 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 32

Mean Lat: 26883.16ms / 30561.40ms = 0.880
TP2 + PP1 + DP4 + BLKSZ 128 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 8
v.s.
TP1 + PP1 + DP8 + BLKSZ 64 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
```
![[260410-105227-11.png]]

- Trace A | 10min
```
P95 TTFT: 232.67ms / 1348.00ms = 0.173
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 32

Mean Lat: 2738.62ms / 6406.26ms = 0.427
TP8 + PP1 + DP1 + BLKSZ 32 + MAX_BATCH_TOKENS 32768 + MAX_SEQS 16
v.s.
TP1 + PP1 + DP8 + BLKSZ 32 + MAX_BATCH_TOKENS 16384 + MAX_SEQS 32
```

#### TODO

- [x] add input_tokens_total and calculate `overall_tps` and `decode_tps`
- [x] 如果 `client.jsonl` 的 record 是 timeout 等 error，skip

```
{
  "duration_ms": "313069.882",
  "e2e_mean_ms": "8970.499",
  "e2e_p50_ms": "9073.024",
  "e2e_p90_ms": "13968.607",
  "e2e_p95_ms": "14507.930",
  "e2e_p99_ms": "14859.424",
  "output_tokens_total": "22566",
  "requests_success": "155",
  "requests_total": "155",
  "throughput_rps": "0.495",
  "throughput_tps": "72.080",
  "tpot_mean_ms": "131.278",
  "tpot_p50_ms": "64.768",
  "tpot_p90_ms": "284.904",
  "tpot_p95_ms": "583.770",
  "tpot_p99_ms": "983.728",
  "ttft_mean_ms": "2294.045",
  "ttft_p50_ms": "2423.531",
  "ttft_p90_ms": "3885.958",
  "ttft_p95_ms": "4358.413",
  "ttft_p99_ms": "5140.597"
}
```

- [x] 更新 vllm to main to fix DP performance problem
- [x] AITuner 添加一个性能 monitor 的角色，多次 iter 得到相似性能时，该 role 指示 LLM 做更激进的探索
- [x] 支持 early stop，当某个 config 性能已经明显爆炸时，提前 quit 测试，避免跑完整的 1h+
- [x] 让 LLM 显式给出每轮 iter 的从数据分析得到的理由以及预期优化目标，并和实际测试结果进行对比
- [x] 比较 10min/60min 效果，是否能缩短时间
- [x] 如何缩短时间：iter 轮数 / 每轮时间 / 冷启动（naive hack vllm）；本质 scalability 问题
- [ ] 定量说明 config 需要变化的频率
- [ ] 参考 AI paper，怎么定义 context engineering
- [ ] 把分桶加入 AITuner search space
- [x] 不同模型的能力比较
- [ ] 支持多机自动化的任务派发和执行回收
- [ ] 支持历史 benchmark 的索引，作为 long time memory


misc:
- [x] 测试 https://github.com/blitz-serving/trace-replayer/tree/feat/stream 并反馈
- [x] 整理 thinking/coder 的开源 trace


## 01.15

#### 1. Agentic AI Tuner

1. 大的系统重构，测试脚本 more like -> Agentic AI Tuner
	![[260410-105227-12.png]]
2. 测试结果和分析
	- 加快了 tuning 时间（~30h -> 6h），成本估计：～6h 单机卡时消耗 + LLM API call ～ 0.5 CNY
	- 不同模型（deepseek-r1 v.s. qwen3-max）能够达成共识，收敛到基本一致的 best config
	- qwen3-max 甚至能 one-shot to near-optimal config
	- Define the reward clear. From `minimize the p95 E2E latency while keep TTFT acceptable` to calculable score: `-ttft_ms_p95 - 10 * tpot_ms_mean + 10 * request_throughput`. 否则在不同的 metrics 之间权衡时（TTFT/TPOT/E2E latency），LLM 不知道怎么定义最优（minimize E2E latency, s.t. TTFT < xxx）
	- [bug] TPOT 几乎不可优化（不同的 config 基本上是在调整 TTFT），这是因为引入的 config fields 不够多。vLLM 支持一些新的 flag（`--prefill-context-parallel-size`, `--decode-context-parallel-size`,  `--all2all-backend`），加入新的 flags 测试中
	- 一些总结：
		- thinking 模式输出极长，由排队论 Little's Law，L=λW，λ～QPS～1，W～mean latency～100+，因此同时运行的请求数 L～100，需要调大 `max_num_seqs`，对于 thinking，调大 `max_num_seqs`，能够显著降低 TTFT（数量级级别的下降），以及把 TP/DP=8/1 -> 4/2，虽然 TPOT 由 14ms -> 24ms，但是能够显著提高 thpt（接近翻倍），明显降低 TTFT
		- 对于 chat/coder，都倾向于提高 TP，降低 TPOT 从而显著提高系统性能

- Chat (Trace A) + DeepSeek-R1
![[260410-105227-13.png]]
- Chat (Trace A) + Qwen3-Max
![[260410-105227-14.png]]
- Thinking + DeepSeek-R1
![[260410-105227-15.png]]
- Coder + DeepSeek-R1
![[260410-105228.png]]
- Coder + Qwen3-Max
![[260410-105228-1.png]]


3. related-works

[StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems](https://arxiv.org/pdf/2510.25017)
> 面向多种存储引擎的 LLM agent 自动调参框架：把“跑实验—提取信号—搜索配置—沉淀经验”拆成 Executor/Extractor/Searcher/Reflector 四个 agent 协同闭环
[ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training](https://arxiv.org/pdf/2511.03844)
> LLM training 的 agentic 优化系统，用 Coordinator/Analyzer/Proposal 将 profiling 工具信号、roofline 分析与“专家最佳实践/历史成功案例”知识库结合起来，把“瓶颈诊断→给出带解释的 sharding/并行配置建议”自动化
[STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems](https://dl.acm.org/doi/epdf/10.1145/3712285.3759887)
> 用于 HPC 并行文件系统的 LLM 自主调参系统，在评估中展示“前几次尝试内即可选到 near-optimal 配置
[Let the Barbarians In: How AI Can Accelerate Systems Performance Research](https://arxiv.org/pdf/2512.14806)

#### 2. Workload / Trace Replayer 开源

- 测试 trace replayer，和凯玺同步实现了一些需要的 feature / 修复了一些问题，开源的 trace replayer 已经集成进 Agentic AI Tuner
- 分析处理了 thinking/coder trace，并准备一份匿名化数据用于开源
	Some Observations
	- 线上大约 10% 以上的请求会触发 tool call，并携带 tool response 的 content 作为 input
	```
	tool_call_cnt=1212, ratio=0.1077

	<|im_start|>system\n{user_sys}{bailian_sys_prompt}<|im_end|>
	<|im_start|>user\n{user_content}<|im_end|>
	<|im_start|>assistant\n<think><|im_end|>
	<|im_start|>user\n<tool_response>\n{tool_content}<\tool_response><|im_end|>
	```
	- thinking 下 input_length < 8k，但 input_length + output_length > 8k 的占比大约 20%（相比之下 coder 为 1%），导致多轮对话少（因为只从一个 cluster 内采集，可能是被分配到了不同的 cluster）
	- 每一轮的 thinking content 不会被加入到 context 中
	- `token_ids[0]` 可能是 `'truncated'`

#### TODO

- [x] 更新 agent 支持 `all2all_backend`, `prefill_context_parallel_size`, `decode_context_parallel_size` 等搜索空间，*修复 bug*
- [x] 确认匿名 trace 的 session 间 prefix 命中情况
- [ ] human feedback add to context
- [ ] 搭建 agent 全局知识库，实现测试的知识迁移，降低冷启动时间
- [x] 总结 related-works，目前有 1. 做 agent 服务别的场景（training/storage/...) 2. BO 用来调优 inference config 的，总结他们的现状，说明不同
- [ ] 测试复现 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning ，作为 BO 的 baseline


## 01.20
#### 1. AITuner

搭建完成一套基本可使用的 agent tuning 系统，考虑下一阶段的目标

https://x8csr71rzs.feishu.cn/docx/FyChdSYeKoiqzAxQTqScsiqjnvd

#### 2. Trace Open source

https://code.alibaba-inc.com/shensijie.ssj/qwen-bailian-usagetraces-anon/tree/main

#### TODO

- [ ] 确认 Ali 内部的不同模型的运行时 config，作为 baseline 比较
- [x] 全局知识库的搭建
- [x] ready-check 轮询期间如果发现 vllm 进程退出（return code 非 0），立刻 fail-fast，并把 stderr/log tail 解析成结构化 `invalid_config`


## 01.28

#### Agentic AITuner

- 和涛老师确认了线上真实使用的配置，开始与真实的线上情况进行对比

现存的问题：
- vLLM DP 是假 DP，存在同步等开销，和 replica 的性能对比有较为明显的区别
- 曾经的测量为直接的 latency 比较，但是线上会把单机 8 卡拆为 8 个 instance，每个 instance 独立 serving，因此曾经比较 latency 的方式来做 tuning 不可行（需要一个 global scheduler 的支持），我们需要转换为 Goodput / GPU 的比较方式
- Goodput 的测量很慢（binary search），是否有等价的转换方式，单次测试时间为 T，则测量 goodput 需要 $O(T \cdot \log(QPS))$，能否减少到 $O(T)$
- AI Tuner 目前测试中没有找到更好的配置，而且和我期望的 tuning 方向总是不同
	- 发现：之前实现了一套 long-time memory，在现在反而起到了误导性的作用，曾经是沿着 latency 优化，存在一些 long-time memory 中的 insights 让 LLM 总是 prefer 多卡并行等，即使我告诉 LLM 可以使用 replica 也不太响应

#### TODO

- [x] 修复 workload sampler 的问题
- [x] 修复 goodput binary search
- [x] 支持关闭 long-time memory
- [ ] 当一个 replica 不需要吃满时，支持多卡并行


## 0204

![[260410-105228-2.png]]


![[260410-105228-3.png]]
![[260410-105228-4.png]]
![[260410-105228-5.png]]


- SLO 5s TTFT + 0.1s TPOT | 5min
	- qwen3-30b-a3b | trace A，能够找到比 baseline 好的配置，但是只好 5%，主要在于 scheduling batch 的改变，加大了 batch size，3.36 vs 3.21
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260128-195623.jsonl
	- qwen3-30b-a3b | thinking，使用 tp2+replica4，1.51 vs 1.27
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-103543.jsonl
	- qwen3-30b-a3b | coder，使用 tp2+replica4，1.54 vs 1.24
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-101657.jsonl


- SLO (0.001L + 1s) TTFT + 0.1s TPOT | 5min

3. qwen3-30b-a3b | trace A，使用 tp2+replica4，3.31 vs 3.06
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260130-114633.jsonl

4. qwen3-30b-a3b | thinking，使用 "max_num_batched_tokens": 13312, "max_num_seqs": 104，1.51 vs 1.24
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260130-115007.jsonl

5. qwen3-30b-a3b | coder，使用 tp2+replica4，1.54 vs 1.22
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260129-101657.jsonl

#### TODO

- [x] [Bug] 使用 Goodput per GPU 的比较之后，agent early quit（由于最开始 propose 的 3 个 config 都是使用了 8卡，导致了失败
- [x] 测试分析 coder 5min 的结果
- [x] 测试不同的 SLO，选择 TTFT=5s, TPOT=40ms（理由：发现 TP2 下 goodput 相对差一些，但是 perf 明显好）
- [x] 支持线性 TTFT SLO
- [x] goodput 应该是一个单调的性质，baseline 测试之后，后面所有的应该至少从 baseline 开始测，如果不如 baseline 直接 abort
- [x] 换一段时间 trace 测试 tuning 得到的 config

## 0304

实验结果： https://x8csr71rzs.feishu.cn/docx/HpkWdzRKroMq9VxWT9qcPpqEnke?from=from_copylink

Tuning 逻辑复盘： https://x8csr71rzs.feishu.cn/docx/OTg7di8sdoBX0PxOhAEc4VosnCp?from=from_copylink

#### TODO

- [ ] 确认百炼线上使用的环境（vllm 版本 / 硬件环境），在线上进行比较
- [ ] 尝试 agent 是否能做更严格的 bottleneck 分析


## Ongoing

### Qwen3-30B

- SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs

	- qwen3-30b-a3b | trace A，使用 tp2，2.634 vs 2.137
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-095345.jsonl
		![[260410-105228-6.png]]
	- qwen3-30b-a3b | thinking，使用 tp4，0.298 vs N/A，thinking 下使用 1200 reqs，3600 reqs 的测试时间不可接受
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-194508.jsonl
		![[260410-105228-7.png]]
	- qwen3-30b-a3b | coder，使用 tp4，0.866 vs 0.612
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260203-095632.jsonl
		![[260410-105228-8.png]]


- SLO (0.001L + 1s) TTFT + 0.05s TPOT | 7200 reqs

	- qwen3-30b-a3b | trace A，使用 tp2，2.078 vs 1.784
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260204-011648.jsonl
		![[260410-105228-9.png]]

	- qwen3-30b-a3b | coder，使用 tp4，0.884 vs 0.569
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260204-144911.jsonl


### Qwen3-30B 添加 post-compare

- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003450.jsonl

- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003526.jsonl
	v2: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260219-144823.jsonl (b=10s)

- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-003543.jsonl


- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-234740.jsonl

- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-154224.jsonl
	v2: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260220-104923.jsonl (b=10s)
	b=1s: https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104251.jsonl

- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260208-234705.jsonl

### Qwen3-235B [FP8]

- qwen3-max 做 agent
	- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		0.486 / 0.386
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222655.jsonl
	- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
		0.175 / 0.142
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260212-002234.jsonl
	- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		0.347 / 0.313
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222825.jsonl

	- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
		0.491 / 0.415
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260211-133931.jsonl
	- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		0.180 / 0.144
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260210-222757.jsonl
	- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
		0.366 / 0.290
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260211-192141.jsonl

- deepseek-r1 做 agent
	- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		0.440 / 0.386
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-110457.jsonl
	- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 600 reqs | tail 600 reqs test
		0.177 / 0.148
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260212-183021.jsonl
	- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-163437.jsonl

	- Trace A | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-110941.jsonl
	- Thinking | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 1200 reqs | tail 1200 reqs test
		0.182 / 0.144
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260213-110427.jsonl
	- Coder | SLO (0.001L + 1s) TTFT + 0.05s TPOT | 3600 reqs | tail 3600 reqs test
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260214-164917.jsonl

### 比较不同 SLO 下的情况

- Qwen3-30-A3B | qwen3-max | SLO (0.001L + 1s) TTFT + 0.02s TPOT
	- Trace A |  | 1200 reqs | tail 1200 reqs test
		1.173 / 0.771
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-112827.jsonl
	- Thinking | 600 reqs | tail 600 reqs test
		0.258 / 0.078
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260226-193932.jsonl
	- Coder | 1200 reqs | tail 1200 reqs test
		0.644 / 0.450
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-224109.jsonl


- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.0005L + 0.5s) TTFT + 0.05s TPOT
	- Trace A |  | 1200 reqs | tail 1200 reqs test
		0.258 / 0.201
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260222-112420.jsonl
	- Thinking | 600 reqs | tail 600 reqs test
		0.098 / 0.074
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260221-111155.jsonl
	- Coder | 1200 reqs | tail 1200 reqs test
		0.219 / 0.163
		https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260220-180145.jsonl


### 完整的性能趋势

- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Trace A | 1200 reqs | tail 1200 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104637.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Thinking | 600 reqs | tail 600 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260228-115937.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.1s TPOT | Coder | 1200 reqs | tail 1200 reqs test
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260225-104917.jsonl

- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Trace A | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260227-104815.jsonl
- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Coder | 1200 reqs | tail 1200 reqs test
https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260227-104732.jsonl


- Qwen3-235B-A22B-FP8 | qwen3-max | SLO (0.001L + 1s) TTFT + 0.05s TPOT | Thinking | 1200 reqs | tail 1200 reqs test
==waiting on dash2==

```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only --disable-goodput-best-scale-compare

```


### Latency Compare

Qwen3-235B-A22B | Trace A | 60min | TP4
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-121540/compare_summary.json

Qwen3-235B-A22B | Thinking | 60min | TP4
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-162510/compare_summary.json

Qwen3-235B-A22B | Coder | 60min
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-162616/compare_summary.json (TP4 is bad, match the AITuner)
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-212148/compare_summary.json (TP8)
https://nas.gahow.org/webdav/trace_compare/dash/compare_260302-001737/compare_summary.json


Qwen3-235B-A22B | Coder | 30min | TP8
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-225747/compare_summary.json


Qwen3-30B-A3B | Trace A | 30min | TP2
https://nas.gahow.org/webdav/trace_compare/dash/compare_260301-225950/compare_summary.json


```bash
./run_latency_cmp.sh --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --min-requests 3600

./run_latency_cmp.sh --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --trace-minutes 60 --candidate-config baseline/qwen3-235b-a22b_prefill_ep_tp8.json --baseline-time-scale 1.0 --candidate-time-scale 2.0
```


### Reasoning 总结

- [ ] 总结 agentic AITuner 分析的结果，config 是怎么从 context 来的

Trace A `ai_tuner_260210-222655.llm.parsed.jsonl`

首先尝试修改 TP 发现没有提升，然后开启了 EP+high_throughtput，发现有明显提升 1.07 vs 0.88，然后开始改 batching 相关参数，发现改大 batching 会导致 SLO violation，回退到默认的 batching 参数，此时有趣的是模型会重跑一次当时发现的最好的 config，确认它的性能，然后再做 recommendation
LLM 会发现改 EP backend 没什么收益，快速放弃尝试别的 EP backend

问题：
- TP=2 已经失败过一次，LLM 又进行了一次推荐
- 很多时间在重复推相同 config 在不同 time_scale 直接测试性能

> After extensive exploration of 16 successful configurations, the optimal setup for maximizing goodput per GPU while meeting SLOs is to enable expert parallelism with the deepep_high_throughput all2all backend. This configuration consistently achieves ~1.09 goodput QPS per GPU at time_scale=1.33, which is a 23% improvement over the baseline (0.88). Key findings: 1) Expert parallelism is crucial for this MoE model, 2) Conservative batching parameters (8192 tokens, 64 sequences) are necessary to maintain SLO compliance, 3) Block size of 32 performs slightly better than 16, 4) Other all2all backends (low_latency, flashinfer, pplx) don't provide significant benefits over high_throughput, 5) Pipeline parallelism and smaller tensor parallelism configurations either failed or didn't improve performance.


- 在 LLM 已经决定要 final 的时候，强制延长步数并没有用，LLM 只会愚蠢的继续做一些毫无意义的尝试。
- 模型并没有做严格的逻辑推导和 bottleneck 分析，例如「由于看到 xxx，根据学到的 yyy，我判断开启 EP 能够缓解这个问题」。模型在做的是，「修改 TP 作用不大，这是一个 MoE 模型，所以我尝试一下开启 EP」

Trace A `ai_tuner_260210-222825.llm.parsed.jsonl`


### Ali Deployment

[[Ali Deployment]]


Qwen3-235B Prod 配置：
1. 【Prod】【P】【0-8K&SSE】【bj-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.5a39103cAE5kRw
2. 【Prod】【Chat】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-chat/deployments/qwen3-235b-a22b-chat-1b9d
3. 【Prod】【bj-96-d】【fp4】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-model/deployments/qwen3-235b-a22b-model-8b5d?spm=43a6e6f6.5d389483.0.0.7eb6103cee6UDx

#### Trace A | 10min
- Ali Baseline
```
"ttft_mean_ms": "855.756",
"ttft_p50_ms": "394.927",
"ttft_p90_ms": "2053.206",
"ttft_p95_ms": "2908.505",
"ttft_p99_ms": "4271.465"
```

- 社区版
```
"ttft_mean_ms": "516.033",
"ttft_p50_ms": "187.580",
"ttft_p90_ms": "1301.450",
"ttft_p95_ms": "1727.253",
"ttft_p99_ms": "3665.331"
```


```bash
python3 scripts/check_prod_dashllm_flow.py --model-path /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 --model-name pre-engine-config-tuning-research --endpoint https://poc-dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation --engine-config-path /usr/local/lib/python3.12/dist-packages/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20.config --authorization-bearer-token-env sk-1fdaf28f959c442996634bd38f6e07b4
```


`/usr/local/bin/dashllm_cmd serving`

`/usr/local/lib/python3.12/dist-packages/vllmgen` 中有真正用的 config
`/usr/local/lib/python3.12/dist-packages/dashgen` 的 config 是 old 的

`DS_LLM_MULTI_ENGINE_NUM=2`
`export DS_SERVING_PROTOCOL=openai`

For EP
```
VLLM_MOE_USE_DEEPEP
VLLM_USE_DEEP_GEMM
"enable_expert_parallel": true,
```

```
AcclEP: checkhang=0, hangtimeout=0
AcclEP: checkhang=0, hangtimeout=0
AcclEP: checkhang=0, hangtimeout=0
[0/4,TP0/EP0][pid=126019] AcclEP: Buffer init with: nRanks 4, num_nvl_bytes: 104.940416MB, num_rdma_bytes: 270.009472MB, low_latency_mode: True, num_qps_per_rank: 24, allow_nvlink_for_low_latency_mode: True, allow_mnnvl: False
AcclEP version: 1.1.0.9.4+384c657-pai
AcclEP: ACCL_LOW_LATENCY_OPTIMIZE = 2
AcclEP: NVSHMEM_IBGDA_NUM_RC_PER_PE = 4
AcclEP: checkhang=0, hangtimeout=0

 AcclEP: Initialize low latency buffer:
  rank = 0
  num_ranks = 4
  num_nvl_bytes = 104940416
  num_rdma_bytes = 270 MB
  low_latency_mode = 1
  dispatch_use_fp8 = 1
  combine_use_fp8 = 0
  low_latency_optimize = 2
  use_single_buffer = 0
  buffer_opt_fp8 = 0
  buffer_max_tokens = 0
  buffer_max_topk = 0
```

修复 dash 机器上 build trace-replayer 的问题：
```bash
# 安装 ssl 相关
sudo apt update
sudo apt install -y pkg-config libssl-dev

# 解决在 nfs 上的 build 问题，放到本地 build
export CARGO_HOME=/tmp/cargo
export CARGO_TARGET_DIR=/tmp/target
```

需要 `DASH_GEN_ENABLE=1` 来保证使用 dashgen
`export MODEL_PATH=/root/dashllm/serving/AnyModel/model`

TODO:
- [x] 启动 dashllm，本地发送 http 请求

Warning: please use at least NVCC 12.9 for the best DeepGEMM performance

[WARNING] 2026-03-06 00:17:44,517037: {"filename":"vllm.py","lineno":784,"message":"DeepEP (expert parallel) doesn't support full cudagraphs for prefill/mixed batches. Overriding c


### 比较不同模型的能力

```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl --time-scale 2.0 --objective 'maxmize goodput per GPU' --llm-provider prism --llm-model 'claude-sonnet-4-6' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only
```

Coder | 1200 reqs
- Sonnet-4.6
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260308-174846.jsonl
- GPT-5.2
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260309-105131.jsonl
- Gemini-3.1-pro
	https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260310-084234.jsonl

### 测试 codex 直接进行 tuning 的效果

[[codex-problems]]

`codex -a never -s workspace-write`

- v3: refined from v2

```
You are an autonomous systems tuning agent.

Your task is to discover the best configuration for a vLLM serving system
by running real experiments.

Your objective is to maximize throughput per GPU while satisfying strict
SLO constraints.

You must design the tuning strategy, run experiments, analyze results,
and iteratively improve the configuration.

You are starting from an EMPTY PROJECT DIRECTORY.

You must build the minimal experimental framework needed to complete
this task.

Your goal is NOT simply to find a good configuration quickly.

Your goal is to explore the configuration space sufficiently and provide
evidence that the final configuration is close to optimal for this workload.

Prematurely declaring a configuration as "best" without sufficient
exploration is considered an incomplete solution.

--------------------------------------------------
Workspace and File Access Rules (STRICT)
--------------------------------------------------

Your primary workspace is the CURRENT PROJECT DIRECTORY.

You may freely read and modify any file located inside this directory.

This includes:

- scripts you create
- experiment logs
- helper utilities
- the Python virtual environment located at:

      .venv/

You may also inspect the vLLM source code located at:

      .venv/lib/python3.12/site-packages/vllm

Reading this code is allowed so you can understand:

- available configuration parameters
- scheduler behavior
- KV cache implementation
- MoE / expert parallel configuration
- batching and token scheduling logic

--------------------------------------------------
Allowed External Read-Only Resources
--------------------------------------------------

You are allowed to READ (but not modify) the following paths.

Model directory:

    ~/models/Qwen/Qwen3-235B-A22B-FP8

You may inspect the model files to understand:

- model architecture
- tensor shapes
- layer structure
- MoE configuration
- hidden sizes
- attention configuration

Workload trace:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

You may read this file to load requests.

--------------------------------------------------
Forbidden Paths
--------------------------------------------------

You MUST NOT read or use code from any other locations.

Do NOT access:

- parent directories (..)
- other repositories
- external tuning frameworks
- any existing ai_tuner implementations
- system-wide Python installations
- unrelated directories

All tuning logic must be implemented by you
inside the current project directory.

--------------------------------------------------
Environment Setup
--------------------------------------------------

Before running any experiment activate the environment:

    source .venv/bin/activate

Use the vLLM installation provided in this environment.

Model to serve:

    ~/models/Qwen/Qwen3-235B-A22B-FP8

Workload trace file:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Use ONLY the FIRST 1200 requests in the trace.

Evaluation mode:

    prefill-only

--------------------------------------------------
Serving Requirement
--------------------------------------------------

You must run experiments using the vLLM SERVING ENGINE.

Launch the system using the server interface
(e.g. `vllm serve` or equivalent server mode).

Do NOT use the offline LLM inference API.

The system must behave as a real serving system with
request batching and scheduling.

--------------------------------------------------
Experimental Framework
--------------------------------------------------

You must implement a minimal experiment framework in this directory.

You may create scripts such as:

    trace_loader.py
    workload_runner.py
    launch_vllm.py
    metrics.py
    experiment_runner.py
    tuner.py

These scripts should allow you to:

1. Load the first 1200 requests from the workload trace

2. Launch vLLM with a specified configuration

3. Send requests to the serving system

4. Run evaluation in prefill-only mode

5. Measure latency and throughput metrics

6. Compute SLO pass rate

7. Log experiment results

All experiments must be reproducible.

--------------------------------------------------
Optimization Objective
--------------------------------------------------

Maximize:

    throughput_per_GPU

Subject to the following SLO constraints.

1. TTFT constraint

    TTFT <= (0.001 * L + 1.0) seconds

    where L is the input_length of the request.

2. TPOT constraint

    TPOT <= 0.1 seconds

3. SLO pass rate

    >= 95%

Only configurations satisfying ALL constraints
are considered valid.

--------------------------------------------------
Parameters You May Tune
--------------------------------------------------

You may tune any relevant vLLM configuration parameters
supported by the installed version.

Examples include (but are not limited to):

- tensor_parallel_size
- pipeline_parallel_size
- data_parallel_size
- expert_parallel_size
- block_size
- max_num_batched_tokens
- max_num_seqs
- gpu_memory_utilization
- kv cache parameters
- scheduler parameters
- CUDA graph settings
- kernel optimization flags

You should inspect the vLLM source code if necessary
to discover available configuration options.

--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------

Every experiment MUST append a JSON object to:

    tuning_results.jsonl

Example format:

{
  "experiment_id": 7,
  "config": {
      "tensor_parallel_size": 8,
      "block_size": 16,
      "max_num_batched_tokens": 32768
  },
  "metrics": {
      "throughput": 52.3,
      "throughput_per_gpu": 6.54,
      "ttft_mean": 0.84,
      "ttft_p95": 1.12,
      "tpot_mean": 0.041,
      "tpot_p95": 0.078,
      "slo_pass_rate": 0.972
  },
  "slo_satisfied": true,
  "notes": "Increasing batching improved throughput without violating TTFT."
}

This log file must allow reconstruction of the full tuning trajectory.

--------------------------------------------------
Exploration Memory
--------------------------------------------------

You must maintain an exploration memory file:

    exploration_ledger.json

This file should track:

- tested configurations
- configurations that violate SLO
- current best configuration
- hypotheses about parameter effects
- unexplored promising regions

Future experiments must use this memory to guide search.

--------------------------------------------------
Search Strategy
--------------------------------------------------

Before running experiments you must propose a search plan.

The plan should describe:

- which parameters will be explored first
- which parameters are likely to impact throughput
- how initial exploration will cover diverse regions
- how later experiments will refine promising areas

The search should typically include the following phases:

1. Initial exploration

   Evaluate diverse configurations to understand
   the performance landscape.

2. Frontier discovery

   Identify promising regions of the configuration space.

3. Local refinement

   Perform targeted experiments near the best configurations.

4. Verification

   Confirm the best configuration through repeated runs
   and nearby configurations.

--------------------------------------------------
Search Termination Criteria
--------------------------------------------------

You must continue exploring the configuration space
until there is evidence that performance has converged.

Search may stop when ALL of the following conditions hold:

1. The best throughput_per_GPU has not improved by more
   than 2% over a significant number of recent experiments.

2. Several neighboring configurations of the current best
   configuration have been tested.

3. Additional exploration appears unlikely to produce
   significant improvements.

You must explain why the search can reasonably stop.

--------------------------------------------------
Local Optimality Verification (MANDATORY)
--------------------------------------------------

Before declaring the final configuration you MUST test
multiple configurations that are small variations of
the current best configuration.

Examples include:

- changing block_size
- adjusting max_num_batched_tokens
- adjusting max_num_seqs
- small changes in parallelism configuration

If a neighboring configuration performs better while
satisfying SLO constraints, the search must continue.

--------------------------------------------------
Evaluation Methodology
--------------------------------------------------

All configurations must be evaluated using the SAME dataset:

    first 1200 requests from
    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Evaluation mode:

    prefill-only

For each configuration measure:

- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate

All runs must use the same measurement procedure.

--------------------------------------------------
Final Deliverables
--------------------------------------------------

When tuning finishes output the following.

1. Best configuration

   Provide as JSON or YAML.

2. Exact vLLM launch command

3. Exact evaluation command or script

4. Final measured metrics

   - throughput
   - throughput_per_GPU
   - TTFT mean / p95 / p99
   - TPOT mean / p95 / p99
   - SLO pass rate

5. Total number of experiments executed

6. Table of the TOP 10 configurations ranked by throughput_per_GPU

7. Path to the full experiment log

       tuning_results.jsonl

8. Explanation of why the chosen configuration is likely
   close to optimal within the explored search space.

--------------------------------------------------
Goal
--------------------------------------------------

Discover the best vLLM configuration for serving

    Qwen3-235B-A22B-FP8

on this workload under the given SLO constraints.

The final configuration must be supported by real
experimental evidence and maximize:

    throughput_per_GPU
```

- v2：完全独立的 tuning

```
You are an autonomous systems tuning agent.

Your task is to discover the best configuration for a vLLM serving system
by running real experiments.

Your objective is to maximize throughput per GPU while satisfying strict
SLO constraints.

You must design the tuning strategy, run experiments, analyze results,
and iteratively improve the configuration.

You are starting from an EMPTY PROJECT DIRECTORY.

You must build the minimal experimental framework needed to complete
this task.

--------------------------------------------------
Workspace and File Access Rules (STRICT)
--------------------------------------------------

Your primary workspace is the CURRENT PROJECT DIRECTORY.

You may freely read and modify any file located inside this directory.

This includes:

- scripts you create
- experiment logs
- helper utilities
- the Python virtual environment located at:

      .venv/

You may also inspect the vLLM source code located at:

      .venv/lib/python3.12/site-packages/vllm

Reading this code is allowed so you can understand:

- available configuration parameters
- scheduler behavior
- KV cache implementation
- MoE / expert parallel configuration
- batching and token scheduling logic

--------------------------------------------------
Allowed External Read-Only Resources
--------------------------------------------------

You are allowed to READ (but not modify) the following paths:

Model directory:

    ~/models/Qwen/Qwen3-235B-A22B-FP8

You may inspect the model files to understand:

- model architecture
- tensor shapes
- layer structure
- MoE configuration
- hidden sizes
- attention configuration

Workload trace:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

You may read this file to load requests.

--------------------------------------------------
Forbidden Paths
--------------------------------------------------

You MUST NOT read or use code from any other locations.

Do NOT access:

- parent directories (..)
- other repositories
- external tuning frameworks
- any existing ai_tuner implementations
- system-wide Python installations
- unrelated directories

All tuning logic must be implemented by you
inside the current project directory.

--------------------------------------------------
Environment Setup
--------------------------------------------------

Before running any experiment activate the environment:

    source .venv/bin/activate

Use the vLLM installation provided in this environment.

Model to serve:

    ~/models/Qwen/Qwen3-235B-A22B-FP8

Workload trace file:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Use ONLY the FIRST 1200 requests in the trace.

Evaluation mode:

    prefill-only

--------------------------------------------------
Serving Requirement
--------------------------------------------------

You must run experiments using the vLLM SERVING ENGINE.

Launch the system using the server interface
(e.g. `vllm serve` or equivalent server mode).

Do NOT use the offline LLM inference API.

The system must behave as a real serving system with
request batching and scheduling.

--------------------------------------------------
Experimental Framework
--------------------------------------------------

You must implement a minimal experiment framework in this directory.

You may create scripts such as:

    trace_loader.py
    workload_runner.py
    launch_vllm.py
    metrics.py
    experiment_runner.py
    tuner.py

These scripts should allow you to:

1. Load the first 1200 requests from the workload trace

2. Launch vLLM with a specified configuration

3. Send requests to the serving system

4. Run evaluation in prefill-only mode

5. Measure latency and throughput metrics

6. Compute SLO pass rate

7. Log experiment results

All experiments must be reproducible.

--------------------------------------------------
Optimization Objective
--------------------------------------------------

Maximize:

    throughput_per_GPU

Subject to the following SLO constraints.

1. TTFT constraint

    TTFT <= (0.001 * L + 1.0) seconds

    where L is the input_length of the request.

2. TPOT constraint

    TPOT <= 0.1 seconds

3. SLO pass rate

    >= 95%

Only configurations satisfying ALL constraints
are considered valid.

--------------------------------------------------
Parameters You May Tune
--------------------------------------------------

You may tune any relevant vLLM configuration parameters
supported by the installed version.

Examples include (but are not limited to):

- tensor_parallel_size
- pipeline_parallel_size
- data_parallel_size
- expert_parallel_size
- block_size
- max_num_batched_tokens
- max_num_seqs
- gpu_memory_utilization
- kv cache parameters
- scheduler parameters
- CUDA graph settings
- kernel optimization flags

You should inspect the vLLM source code if necessary
to discover available configuration options.

--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------

Every experiment MUST append a JSON object to:

    tuning_results.jsonl

Example format:

{
  "experiment_id": 7,
  "config": {
      "tensor_parallel_size": 8,
      "block_size": 16,
      "max_num_batched_tokens": 32768
  },
  "metrics": {
      "throughput": 52.3,
      "throughput_per_gpu": 6.54,
      "ttft_mean": 0.84,
      "ttft_p95": 1.12,
      "tpot_mean": 0.041,
      "tpot_p95": 0.078,
      "slo_pass_rate": 0.972
  },
  "slo_satisfied": true,
  "notes": "Increasing batching improved throughput without violating TTFT."
}

This log file must allow reconstruction of the full tuning trajectory.

--------------------------------------------------
Evaluation Methodology
--------------------------------------------------

All configurations must be evaluated using the SAME dataset:

    first 1200 requests from
    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Evaluation mode:

    prefill-only

For each configuration measure:

- throughput
- throughput_per_GPU
- TTFT mean / p95 / p99
- TPOT mean / p95 / p99
- SLO pass rate

All runs must use the same measurement procedure.

--------------------------------------------------
Search Strategy
--------------------------------------------------

Use a structured exploration strategy.

Recommended phases:

1. Initial exploration

   Run diverse configurations to understand
   the performance landscape.

2. Frontier discovery

   Identify promising regions of the configuration space.

3. Local refinement

   Fine tune parameters near the best configurations.

4. Confirmation

   Re-run the best configuration to verify stability.

Avoid wasting time on configurations that clearly violate
SLO constraints.

Avoid re-running identical configurations unless verifying
the best result.

--------------------------------------------------
Final Deliverables
--------------------------------------------------

When tuning finishes output the following.

1. Best configuration

   Provide as JSON or YAML.

2. Exact vLLM launch command

3. Exact evaluation command or script

4. Final measured metrics

   - throughput
   - throughput_per_GPU
   - TTFT mean / p95 / p99
   - TPOT mean / p95 / p99
   - SLO pass rate

5. Summary table of key tested configurations

6. Path to the full experiment log

       tuning_results.jsonl

7. Short explanation of why the chosen configuration
   is optimal.

--------------------------------------------------
Goal
--------------------------------------------------

Discover the best vLLM configuration for serving

    Qwen3-235B-A22B-FP8

on this workload under the given SLO constraints.

The final configuration must be supported by real
experimental evidence and maximize:

    throughput_per_GPU
```


prompt:

```
You are an autonomous systems tuning agent.

Your goal is to tune the configuration of a vLLM serving system to maximize throughput per GPU for a specific workload, while satisfying strict SLO constraints.

You must run real experiments, analyze results, and iteratively improve the configuration.

--------------------------------------------------
Environment Setup
--------------------------------------------------

First activate the Python environment:

    source .venv/bin/activate

Use the vLLM installation inside this environment.

Model to serve:

    ~/models/Qwen/Qwen3-235B-A22B-FP8

Workload trace file:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Use only the FIRST 1200 requests in the trace as the evaluation dataset.

Evaluation mode:

    prefill-only

--------------------------------------------------
Important Restrictions
--------------------------------------------------

1. DO NOT use my ai_tuner code or its search logic.

2. You MAY use simple helper utilities such as workload samplers or trace loaders.

3. All tuning logic must be designed and executed by yourself.

4. Every configuration must be evaluated by actually running the system.

5. Avoid repeating identical configurations unless confirming the best result.

--------------------------------------------------
Optimization Goal
--------------------------------------------------

Maximize:

    throughput_per_GPU

Subject to SLO constraints:

1. TTFT <= (0.001 * L + 1.0) seconds
   where L is the input_length.

2. TPOT <= 0.1 seconds

3. SLO pass rate >= 95%

Only configurations satisfying all SLO constraints are considered valid.

--------------------------------------------------
Parameters You May Tune
--------------------------------------------------

Tune relevant vLLM serving parameters, including (but not limited to):

- tensor_parallel_size
- pipeline_parallel_size (if applicable)
- data_parallel_size (if applicable)
- expert_parallel_size / MoE related options
- block_size
- max_num_batched_tokens
- max_num_seqs
- scheduling / batching parameters
- gpu_memory_utilization
- kv-cache settings
- cuda graph settings
- compilation / kernel optimization knobs

Only tune parameters supported by the installed vLLM version.

--------------------------------------------------
Experiment Logging (MANDATORY)
--------------------------------------------------

You MUST record every experiment result in a file:

    tuning_results.jsonl

Each experiment must append one JSON object to this file.

Example schema:

{
  "experiment_id": 7,
  "config": {
      "tensor_parallel_size": 8,
      "block_size": 16,
      "max_num_batched_tokens": 32768
  },
  "metrics": {
      "throughput": 52.3,
      "throughput_per_gpu": 6.54,
      "ttft_mean": 0.84,
      "ttft_p95": 1.12,
      "tpot_mean": 0.041,
      "tpot_p95": 0.078,
      "slo_pass_rate": 0.972
  },
  "slo_satisfied": true,
  "notes": "Increasing batching improved throughput without violating TTFT."
}

This file must allow someone to reconstruct the entire tuning trajectory and observe how performance evolves during the search.

--------------------------------------------------
Evaluation Methodology
--------------------------------------------------

Use the first 1200 requests from:

    qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl

Run the workload in:

    prefill-only mode

For each configuration measure:

- throughput
- throughput per GPU
- TTFT statistics
- TPOT statistics
- SLO pass rate

All runs must use the same dataset and measurement procedure.

--------------------------------------------------
Tuning Strategy
--------------------------------------------------

Use a structured exploration strategy.

Suggested approach:

1. Initial exploration
   - test a diverse set of configurations to understand the performance landscape

2. Frontier search
   - identify promising regions

3. Local refinement
   - fine tune parameters near the best configs

4. Confirmation
   - rerun the final best configuration once more

Avoid wasting time on configurations that clearly violate SLO constraints.

--------------------------------------------------
Execution Policy
--------------------------------------------------

You are allowed to:

- inspect repository files
- write helper scripts
- write workload runners
- write measurement scripts

If necessary, create scripts to:

- load the first 1200 requests
- launch vLLM with a given config
- run prefill-only evaluation
- compute metrics
- append results to tuning_results.jsonl

Ensure experiments are reproducible.

--------------------------------------------------
Final Deliverables
--------------------------------------------------

At the end of the tuning process output:

1. Best configuration (JSON or YAML)

2. Exact vLLM launch command

3. Exact evaluation command or script

4. Final measured metrics
   - throughput
   - throughput_per_GPU
   - TTFT mean / p95 / p99
   - TPOT mean / p95 / p99
   - SLO pass rate

5. Summary table of key tested configurations

6. Path to the full experiment log:

       tuning_results.jsonl

7. Brief explanation of why the chosen configuration is optimal.

--------------------------------------------------
Goal
--------------------------------------------------

Find the single best vLLM configuration for serving

    Qwen3-235B-A22B-FP8

on this workload under the given SLO constraints, based on real experimental evidence.
```


### 观察

- tuning 得到的 long-time memory 太过 specific，没有泛化性，看起来是直接总结了当前 tuning 得到的 config 的好坏，改变不同 flag 的值的效果（例如：直接得到 insight，TP=4 好），这样的 memory 反而可能会影响迁移的效果，模型并没有总结到 (model, hardware) -> config 的更加 high level 的可迁移的 memory
- 一个有趣的事情，thinking trace 下，使用 50ms TPOT SLO 会在 baseline 就 violate（即使 time_scale 已经足够小 0.0625），因此 trace/model/hardware 的特性会决定了 SLO 的选择
- 在 trends 的调整中，一个核心的 config knob 可能决定性影响了性能，别的 knobs 很大程度是小修小补（参考 https://nas.gahow.org/webdav/ai_tuner_logs/dash/ai_tuner_260224-104637.jsonl ）
- Qwen3-30B 在 thinking trace 下，如果使用 SLO 为「SLO (0.001L + 1s) TTFT + 0.02s TPOT」，会直接导致大量 config 都 illegal，陷入长时间运行最终积累太长的 context 导致超过模型的最长上下文抛出 error


### 长期 TODO

- [x] 支持 tuning 之后在另一段时间的 trace 做 test
- [x] 换一段时间 trace 测试 tuning 得到的 config，比较性能结果
- [ ] time scaled trace 会每次 benchmark 生成一份，太过冗余，可修复 [不重要]
- [x] 确认 Qwen3-30B 下 thinking 1200 requests 的结果
- [ ] 按 L_in 分桶的 SLO 满足率/TTFT 分布（0–1k、1k–2k、…、4k–8k）
- [x] 实现一个脚本比较两个 config，不做 binary search，直接比较相同 time_scale 下的性能
- [ ] 使用内部的 vllm 比较性能
- [x] 使用 `--disable-goodput-best-scale-compare`，得到 trends 的图

- [ ] 实现 explain_tradeoff(a, b) → compare two configs


- [ ] 比较 bailian 的 vllm

```
"best": {"run_id": "agent-3-goodput", "success": true, "aggregated": {"latency_ms_mean": 20224.639, "latency_ms_p50": 14211.797, "latency_ms_p90": 45184.48, "latency_ms_p95": 56841.231, "latency_ms_p99": 86661.03, "ttft_ms_mean": 263.386, "ttft_ms_p50": 180.564, "ttft_ms_p90": 542.365, "ttft_ms_p95": 752.534, "ttft_ms_p99": 1173.739, "tpot_ms_mean": 53.722, "tpot_ms_p50": 51.458, "tpot_ms_p90": 93.986, "tpot_ms_p95": 98.752, "tpot_ms_p99": 107.601, "overall_qps": 6.117, "overall_tokens_per_second": 13851.267873746021, "decode_tokens_throughput": 2285.326, "total_requests": 2114.0, "decode_tokens_total": 789757.0, "tokens_per_second_mean": 26.38564782306047, "tokens_per_second_p50": 19.931698614673515, "tokens_per_second_p95": 58.066780649606464, "tokens_per_second_p99": 85.59372084463882, "request_throughput": 6.117, "notes_count": 1.0, "failed_request_count": 0.0, "ttft_slo_pass_rate": 0.9981078524124882, "goodput_qps": 26.534677249842353, "goodput_qps_per_gpu": 3.316834656230294, "goodput_tps": 55405.071494984084, "goodput_tps_per_gpu": 6925.6339368730105, "goodput_slo_ok": 1.0}, "config": {"tensor_parallel_size": 2, "pipeline_parallel_size": 1, "data_parallel_size": 1, "prefill_context_parallel_size": null, "decode_context_parallel_size": null, "block_size": 32, "max_num_batched_tokens": 131072, "max_num_seqs": 1024, "scheduling_policy": null, "gpu_memory_utilization": null, "enable_expert_parallel": null, "all2all_backend": null, "max_model_len": 65536, "enable_prefix_caching": true, "extra": null}}
```


测试 TODO：
- [x] qwen3-30b-a3b | traceA | 60min
- [x] qwen3-30b-a3b | thinking | 60min
- [x] qwen3-30b-a3b | coder | 60min

- [x] qwen3-235b-a22b | traceA | 60min
- [x] qwen3-235b-a22b | thinking | 60min
- [x] qwen3-235b-a22b | coder | 60min


#### Misc

vllm latest 和 bailian vllm 的对比：发现 vllm latest 的性能更好

- vllm latest
```json
{
    "latency_ms_mean": 13343.749,
    "latency_ms_p50": 10129.267,
    "latency_ms_p90": 28361.724,
    "latency_ms_p95": 36335.392,
    "latency_ms_p99": 62868.229,
    "ttft_ms_mean": 458.187,
    "ttft_ms_p50": 154.414,
    "ttft_ms_p90": 1132.518,
    "ttft_ms_p95": 2089.66,
    "ttft_ms_p99": 3884.956,
    "tpot_ms_mean": 35.788,
    "tpot_ms_p50": 32.182,
    "tpot_ms_p90": 59.733,
    "tpot_ms_p95": 64.583,
    "tpot_ms_p99": 72.093,
    "overall_qps": 2.735,
    "overall_tokens_per_second": 5624.112937416796,
    "decode_tokens_throughput": 982.428,
    "total_requests": 926.0,
    "decode_tokens_total": 332623.0,
    "tokens_per_second_mean": 35.821798532876514,
    "tokens_per_second_p50": 32.8715120938958,
    "tokens_per_second_p95": 61.36676048962492,
    "tokens_per_second_p99": 120.0,
    "request_throughput": 2.735,
    "notes_count": 0.0,
    "failed_request_count": 0.0,
    "ttft_slo_pass_rate": 0.9503239740820735,
    "goodput_qps": 24.5244744278846,
    "goodput_qps_per_gpu": 3.065559303485575,
    "goodput_tps": 44992.90349933437,
    "goodput_tps_per_gpu": 5624.112937416796,
    "goodput_slo_ok": 1.0
}
```

- bailian
```json
{
  "duration_ms": "361609.694",
  "e2e_mean_ms": "22361.979",
  "e2e_p90_ms": "44006.550",
  "e2e_p95_ms": "51007.482",
  "e2e_p99_ms": "87502.531",
  "output_tokens_total": "332623",
  "requests_success": "579",
  "requests_total": "926",
  "throughput_rps": "2.561",
  "throughput_tps": "919.840",
  "tpot_mean_ms": "58.537",
  "tpot_p90_ms": "84.703",
  "tpot_p95_ms": "104.625",
  "tpot_p99_ms": "132.758",
  "ttft_mean_ms": "826.176",
  "ttft_p90_ms": "2040.190",
  "ttft_p95_ms": "5247.990",
  "ttft_p99_ms": "6321.622"
}
```

```json
{
  "duration_ms": "357189.716",
  "e2e_mean_ms": "24407.132",
  "e2e_p90_ms": "46008.208",
  "e2e_p95_ms": "57420.132",
  "e2e_p99_ms": "91012.090",
  "output_tokens_total": "789757",
  "requests_success": "969",
  "requests_total": "2114",
  "throughput_rps": "5.918",
  "throughput_tps": "2211.030",
  "tpot_mean_ms": "69.447",
  "tpot_p90_ms": "119.914",
  "tpot_p95_ms": "135.565",
  "tpot_p99_ms": "179.668",
  "ttft_mean_ms": "316.557",
  "ttft_p90_ms": "790.313",
  "ttft_p95_ms": "970.607",
  "ttft_p99_ms": "1743.996"
}

```

---
## 测试命令

```bash
python3 scripts/print_vllm_launch_cmd.py --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --config-file test_config.json
```

```bash
vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-235B-A22B --port 10086 --tensor-parallel-size 4 --block-size 64 --max-num-batched-tokens 8192 --max-num-seqs 64 --gpu-memory-utilization 0.8 --max-model-len 131072 --enable-prefix-caching --cudagraph-capture-sizes 16 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 640 672 704 736 768 800 832 864 896 928 960 992 1024 --disable-hybrid-kv-cache-manager --enable-chunked-prefill --kv-cache-dtype fp8 --quantization fp8 --swap-space 0
```

- cmd template

```bash
./run_ai.sh --trace-minutes 60 --max-requests 1200 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-235b-a22b_prefill.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-max-requests 1200 --model-path ~/models/Qwen/Qwen3-235B-A22B-FP8 --trace-replayer-engine-ready-timeout 3000 --prefill-only
```


```bash
./run_ai.sh --trace-minutes 60 --max-requests 3600 --max-llm-calls 100 --max-steps 100 --ttft-slo 20 --tpot-slo 0.05 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen-bailian-usagetraces-anon/qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-30b-a3b.json --min-proposed-runs 16 --disable-memory-context --disable-cache --compare-best-vs-baseline --compare-trace-minutes 60 --compare-max-requests 3600
```

```sh
./run_ai.sh --trace-minutes 5 --max-llm-calls 100 --max-steps 100 --ttft-slo 5 --tpot-slo 0.1 --ttft-slo-linear "k=0.001,b=1.0,pass_rate=0.95" --trace-path qwen_traceA_blksz_16.jsonl --time-scale 8.0 --objective 'maxmize goodput per GPU' --llm-model 'qwen3-max' --replica-eval-mode replica_scaling --baseline-config baseline/qwen3-30b-a3b.json --min-proposed-runs 16 --disable-memory-context --disable-cache

```


```
./run_ai.sh --trace-minutes 60 --max-llm-calls 30 --ttft-slo 5 --tpot-slo 0.1 --trace-path qwen_traceA_blksz_16.jsonl --time-scale 4.0 --objective 'maxmize goodput per GPU' --llm-model 'deepseek-r1' --replica-eval-mode replica_scaling
```


```json
{
  "run_config": {
    "run_id": "agent-1-probe-0",
    "env": {
      "hardware_sig": "NVIDIA H20-47121c5fff",
      "model_sig": "Qwen3-30B-A3B-f1f030d7e6",
      "workload_sig": "qwen_traceA_blksz_16-f006af7ea7"
    },
    "engine": "trace-replayer",
    "vllm_config": {
      "tensor_parallel_size": 8,
      "pipeline_parallel_size": null,
      "data_parallel_size": 1,
      "prefill_context_parallel_size": null,
      "decode_context_parallel_size": null,
      "block_size": 32,
      "max_num_batched_tokens": 65536,
      "max_num_seqs": 512,
      "scheduling_policy": null,
      "gpu_memory_utilization": null,
      "enable_expert_parallel": true,
      "all2all_backend": "allgather_reducescatter",
      "max_model_len": 65536,
      "enable_prefix_caching": null,
      "extra": null
    },
    "workload": {
      "workload_name": "qwen_traceA_blksz_16",
      "qps": 40.65387538500513,
      "avg_prompt_tokens": 2146.006614190445,
      "p95_prompt_tokens": 8515.0,
      "avg_decode_tokens": 413.138952662075,
      "p95_decode_tokens": 997.0,
      "request_type": "trace_replay",
      "source_file": "qwen_traceA_blksz_16.jsonl",
      "duration_minutes": 7.4999,
      "time_scale": "8.0",
      "p99_total_tokens": 14357.0
    },
    "objective": "maxmize goodput per GPU",
    "extra": {
      "trace_path": "qwen_traceA_blksz_16.jsonl",
      "trace_minutes": 5.0,
      "time_scale": 8.0,
      "min_input_length": null,
      "max_input_length": null,
      "min_requests": 100,
      "show_progress": true,
      "model_path": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
      "model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
      "trace_replayer_bin": "./trace-replayer/target/release/client",
      "trace_replayer_endpoint": "http://localhost:10086/v1/chat/completions",
      "trace_replayer_tokenizer": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B//tokenizer.json",
      "trace_replayer_tokenizer_config": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B//tokenizer_config.json",
      "trace_replayer_api": "openai",
      "trace_replayer_dataset": "bailian",
      "trace_replayer_dataset_path": "qwen_traceA_blksz_16.jsonl",
      "trace_replayer_scale_factor": 8.0,
      "trace_replayer_time_secs": 300.0,
      "trace_replayer_num_producer": "32",
      "trace_replayer_channel_capacity": "40960",
      "trace_replayer_metric_percentile": "50,90,95,99",
      "trace_replayer_early_stop_error_threshold": 3,
      "trace_replayer_model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/",
      "trace_replayer_engine_launch_cmd": "vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/ --port 10086",
      "trace_replayer_engine_launch_cwd": null,
      "trace_replayer_engine_ready_timeout_s": 600.0,
      "trace_replayer_engine_ready_url": "http://localhost:10086/v1/models",
      "trace_replayer_ttft_slo": 5.0,
      "trace_replayer_tpot_slo": 0.1,
      "early_stop_ttft_ms": null,
      "early_stop_tpot_ms": null,
      "early_stop_window_s": 60.0,
      "early_stop_factor": 10.0,
      "early_stop_min_requests": 10,
      "goodput_slo_percentile": "p95",
      "replica_eval_mode": "replica_scaling",
      "workload_qps": 3.2163245151324062,
      "workload_cache_key": {
        "trace_source": "qwen_traceA_blksz_16.jsonl",
        "trace_minutes": 5.0,
        "time_scale": 8.0,
        "min_input_length": null,
        "max_input_length": null,
        "sample_policy": "head",
        "sample_seed": null,
        "max_requests": null,
        "trace_replayer_dataset": "bailian",
        "trace_replayer_api": "openai"
      },
      "engine_cache_key": {
        "model_path": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
        "model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
        "trace_replayer_model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/",
        "trace_replayer_engine_launch_cmd": "vllm serve /mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B/ --port 10086"
      },
      "evaluation_group_key": "f2831aa3c793",
      "replica_scale_factor": 1,
      "goodput_total_gpus": 8
    },
    "use_cache": false
  },
  "hardware": {
    "gpu_type": "NVIDIA H20",
    "num_gpus": 8,
    "hbm_gb": 95.5771484375,
    "nvlink_topology": "nvlink",
    "cpu_cores": 160,
    "system_memory_gb": 927.5078773498535,
    "world_size": 8
  },
  "model": {
    "model_name": "/mnt/debugger/wjh/models/Qwen/Qwen3-30B-A3B",
    "param_count": 20767571968,
    "hidden_size": 2048,
    "num_layers": 48,
    "num_heads": 32,
    "is_moe": true,
    "num_experts": 128,
    "is_mla": false,
    "num_kv_heads": 4,
    "num_experts_per_tok": 8,
    "vocab_size": 151936,
    "max_position_embeddings": 65536
  },
  "workload": {
    "workload_name": "qwen_traceA_blksz_16",
    "qps": 40.65387538500513,
    "avg_prompt_tokens": 2146.006614190445,
    "p95_prompt_tokens": 8515.0,
    "avg_decode_tokens": 413.138952662075,
    "p95_decode_tokens": 997.0,
    "request_type": "trace_replay",
    "source_file": "qwen_traceA_blksz_16.jsonl",
    "duration_minutes": 7.4999,
    "time_scale": "8.0",
    "p99_total_tokens": 14357.0
  },
  "environment": {
    "hardware_sig": "NVIDIA H20-47121c5fff",
    "model_sig": "Qwen3-30B-A3B-f1f030d7e6",
    "workload_sig": "qwen_traceA_blksz_16-f006af7ea7"
  }
}
```

---


# Misc

- 阿里内部异构资源池上 over-provision 的 GPU 资源数据

- 调研学术工作关于资源池横切（负载均衡），竖切（尽可能空出机器）方案的优劣


挑战：
- 异构硬件上如何快速找到合适的 QPS？naive：二分？
- 异构硬件的算力和带宽都不同，需要的并行模式可能不同？如何快速找到？
- 不同的流量 / 请求 pattern 需要不同的并行模式，如何应对？
- benchmark 与线上流量的差异，如何保证 benchmark 时反应的性能的 fidelity


# 测试规划

1. workload timestamps: timestamp, sim_timestamp_v2, exponential_timestamp
- [x] H20 40min 1.0 scale, 预估 40min * 3 = 2h+
- [ ] 5090 40min 0.5 scale, 预估 4h+
- [x] 5090 120min 0.7 scale
- [x] H20 120min 0.8 scale
- [x] 5090 120min 0.6 scale
- [x] H20 120min 0.7 scale

2. configs: TP: 1, 2, 4, 8, EP: True or False
- [x] H20 30min 1.0 scale, 预估 4h+
- [x] H20 10min 2.0 scale
- [ ] update dash H20 60min 1.0 scale
- [x] 5090 10min 0.5 scale
- [x] H20 5min 0.5 scale
- [x] H20 10min 1 scale

修改 workload：目前只修改了 timestamp，加入每个 timespan 内使用真实 workload 的 avg input length 和 avg output length，是否会有影响？


负载大时，不开 EP 比 EP 好，

> gpu_monitor pid: 1315891

### 测试命令 backup

```sh
./run_ai.sh --trace-minutes 60 --max-llm-calls 30 --ttft-slo 12000 --trace-path qwen_traceA_blksz_16.jsonl --time-scale 1.0 --objective 'minimize the latency mean and p95 and keep the throughtput as high as possible' --objective-score-expr '-ttft_ms_p95 - 10 * tpot_ms_mean + 10 * request_throughput' --llm-model 'qwen3-max'
```

```json
{"latency_ms_mean": 40470.771, "latency_ms_p50": 21644.147, "latency_ms_p90": 95732.058, "latency_ms_p95": 144859.163, "latency_ms_p99": 266144.603, "ttft_ms_mean": 313.929, "ttft_ms_p50": 200.65, "ttft_ms_p90": 482.333, "ttft_ms_p95": 781.285, "ttft_ms_p99": 3014.18, "tpot_ms_mean": 49.839, "tpot_ms_p50": 54.346, "tpot_ms_p90": 68.521, "tpot_ms_p95": 69.732, "tpot_ms_p99": 75.753, "overall_qps": 3.623, "overall_tokens_per_second": 23075.872520269895, "decode_tokens_throughput": 2907.208, "total_requests": 2615.0, "decode_tokens_total": 2098488.0, "tokens_per_second_mean": 26.997015256930624, "tokens_per_second_p50": 18.874719191241816, "tokens_per_second_p95": 73.02218276995958, "tokens_per_second_p99": 120.0, "request_throughput": 3.623, "notes_count": 0}
```


---
下面给你一页“可直接贴进论文/slide”的内容：**1 个对比表 + 结论 bullets**，并附上一段可直接放进论文 *Methodology / Discussion / Related Work（LLM-as-agent）* 的文字（偏学术写法）。

---

- 表 1：三份日志的对比摘要（可贴论文/slide）

> 说明：指标均来自你提供的 `final_summary.best` / `best_run` / `iter_result.aggregated`。单位：latency/ttft/tpot 为 ms，QPS 为 req/s。

- Markdown 版（适合 README/slide）

| Case | Workload / Trace                                                                 | Agent LLM            | Best config（TP/DP/PP, block, mbt, mseq, mem, EP）                     | Key metrics（mean / p95）                                                          | 关键现象（系统层）                                                                                                   | 关键现象（agent 行为）                                                                                                           |
| ---- | -------------------------------------------------------------------------------- | -------------------- | -------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| A    | **TraceA**（60min, time_scale=1.0）`qwen_traceA_blksz_16.jsonl`                    | **deepseek-r1-0528** | **TP=8, DP=1, PP=1**, block=32, mbt=32768, mseq=128, mem=0.95, EP=on | latency **6722 / 17210**; TTFT p95 **241**; TPOT p95 **23.29**; QPS **5.073**    | DP-heavy 点出现**灾难性排队/吞吐下降**（QPS 2~3；latency/TTFT 到 10^6 ms 量级）；且 DP>1 部分 run **run_failed（total_time=None）** | 探索更“全局覆盖”（TP/DP 极值+大跨度 block/mbt/mseq）；但存在 **metric 名不一致→proposal_eval missing_metric**，baseline 也曾被“不稳定 run”污染          |
| B    | **Thinking trace**（30min, time_scale=0.5）`anony-qwen-max-thinking-…blsz16.jsonl` | **deepseek-r1-0528** | **TP=4, DP=2, PP=1**, block=64, mbt=65536, mseq=128, mem=0.95, EP=on | latency **105986 / 385999**; TTFT p95 **738**; TPOT p95 **35.01**; QPS **1.016** | 输出极长（≈3.7k decode tokens/req）→ **并发需求 L≈λW≈108**；mseq 太小会导致 **TTFT/latency 分钟级爆炸**；DP+mseq 是结构性刚需           | 逐步把系统从“队列爆炸区”拉回“稳定区”，再微调 block/mbt；但出现 **proposal/执行 config 不一致**（某些 run 字段前后矛盾）风险                                       |
| C    | **TraceA**（60min, time_scale=1.0）`qwen_traceA_blksz_16.jsonl`                    | **qwen3-max**        | **TP=8, DP=1, PP=1**, block=16, mbt=32768, mseq=256, mem=0.9, EP=on  | latency **6590 / 16925**; TTFT p95 **242**; TPOT p95 **23.08**; QPS **5.073**    | TraceA 下 **throughput 基本由到达率限定**（各点 QPS≈5.073，容量调参梯度弱）→ 参数扰动大多落在 **plateau**                                | baseline 直接给较高 mseq/mbt → **一开始就处于近最优**；探索更“局部微扰”（主要在 TP=8 周边动 block/mbt/mseq）；但发现 **config 落盘/执行不一致**（测试 block=8 实际未生效） |

---

# Ali

- qwen3-30b-a3b 长请求

单机 8 卡 H20 96g；2 个模型实例
```json
{
  "gpu_memory_utilization": 0.6,
  "tensor_parallel_size": 4,
  "enable_think": 0,
  "think_mode": "auto",
  "enforce_eager": false,
  "enable_prefix_caching": true,
  "enable_chunked_prefix_caching": true,
  "distributed_executor_backend": "mp",
  "block_size": 256,
  "enable_chunked_prefill": true,
  "max_num_batched_tokens": 131072,
  "max_model_len": 1048576,
  "max_seq_len_to_capture": 1048576,
  "max_num_seqs": 1,
  "compilation_config": {
    "cudagraph_capture_sizes": [
      1
    ]
  },
  "hf_overrides": {
    "architectures": [
      "Qwen3MoeForCausalLM"
    ]
  },
  "kv_cache_dtype": "fp8_e4m3",
  "quantization": "fp8",
  "rope_scaling": {
    "type": "yarn",
    "factor": 8,
    "original_max_position_embeddings": 131072,
    "semi_dynamic": false,
    "dynamic": true
  }
}
```


- qwen3-30b-a3b 短请求（30k-）

单机 8 卡 H20 96g；8 个模型实例
```json
{
  "gpu_memory_utilization": 0.8,
  "tensor_parallel_size": 1,
  "enable_think": 1,
  "think_mode": "auto",
  "enforce_eager": false,
  "enable_prefix_caching": true,
  "distributed_executor_backend": "mp",
  "block_size": 256,
  "enable_chunked_prefill": true,
  "max_num_batched_tokens": 4096,
  "max_model_len": 262144,
  "max_seq_len_to_capture": 262144
}
```


```json
{
  "compilation_config": {
    "use_inductor": false,
    "full_cuda_graph": false
  },
  "gpu_memory_utilization": 0.9,
  "tensor_parallel_size": 1,
  "enable_think": 0,
  "think_mode": "auto",
  "enforce_eager": false,
  "enable_prefix_caching": false,
  "distributed_executor_backend": "mp",
  "block_size": 256,
  "enable_chunked_prefill": true,
  "max_num_batched_tokens": 4096,
  "max_model_len": 278784,
  "rope_scaling": {
    "type": "yarn",
    "factor": 4,
    "original_max_position_embeddings": 262144,
    "semi_dynamic": false,
    "dynamic": true
  },
  "speculative_config": {
    "method": "eagle3",
    "model": "/home/admin/resource/model/464482ce.qwen3-coder-flash-eagle/state_4/",
    "num_speculative_tokens": 2,
    "draft_tensor_parallel_size": 1,
    "rope_scaling": {
      "type": "yarn",
      "factor": 64,
      "original_max_position_embeddings": 4096,
      "semi_dynamic": false,
      "dynamic": true
    }
  },
  "max_seq_len_to_capture": 131072,
  "hf_overrides": {
    "architectures": [
      "Qwen3MoeForCausalLM"
    ],
    "model_type": "qwen3_moe"
  },
  "quantization": "fp8"
}
```


# Backup

```
cmd=['trace-replayer/target/release/client', '--stream', '--tokenizer', '/mnt/debugger/wjh/models/Qwen3-30B-A3B//tokenizer.json', '--tokenizer-config', '/mnt/debugger/wjh/models/Qwen3-30B-A3B//tokenizer_config.json', '--endpoint', 'http://localhost:10086/v1/chat/completions', '--api', 'openai', '--dataset', 'bailian', '--dataset-path', './qwen_traceA_blksz_16.jsonl', '--scale-factor', '2.0', '--time-in-secs', '600', '--model-name', '/mnt/debugger/wjh/models/Qwen3-30B-A3B/', '--metric-percentile', '50,90,95,99', '--output-path', 'runs/ai_tuner_store/nvidia-h20-47121c5fff__mnt-debugger-wjh-models-qwen3-30b-a3b-10c4a49394__qwen_tracea_blksz_16-d451afd9af/260108-172937/client.jsonl', '--summary-path', 'runs/ai_tuner_store/nvidia-h20-47121c5fff__mnt-debugger-wjh-models-qwen3-30b-a3b-10c4a49394__qwen_tracea_blksz_16-d451afd9af/260108-172937/summary.json', '--num-producer', '32', '--channel-capacity', '40960']
```


是不是可以通过少量 (HW, model, workload) 的测试组合作为先验知识，指导 AI 快速找到新的 (HW, model, workload) 下的最优 config，从而大大加速填表过程

设置优化目标，不是让 AI 随意尝试，而是有针对性的分析，memory/compute/comm/etc. bound，从优化目标出发修改 config


异构硬件资源池上的模型性能调优是不是伪需求？
- 头部模型占据绝大多数流量，基本只会使用规模最大的固定卡型来服务
- 考虑长尾模型，虽然数量上有数百个而且可能是有什么卡能用就用什么卡，但是占据的总体流量只有不到 1.5%，为它们优化性能可能基本没有收益（本身模型参数规模就不大，达到 SLO 比较容易；优化性能的成本不一定值得线上流量的开销）


---

### LaTeX 版（可直接贴论文）

> 你可以把 `\small`/列宽再按版面微调；这是一张“table*”风格，适合双栏论文。

```latex
\begin{table*}[t]
\centering
\small
\setlength{\tabcolsep}{4pt}
\begin{tabular}{p{1.1cm} p{3.2cm} p{2.0cm} p{5.4cm} p{4.7cm} p{4.7cm}}
\hline
Case & Workload / Trace & Agent LLM &
Best config (TP/DP/PP, block, mbt, mseq, mem, EP) &
Key metrics (mean / p95) &
Notable observations (system \& agent) \\
\hline
A &
TraceA (60min, scale=1.0) \texttt{qwen\_traceA\_blksz\_16} &
DeepSeek-R1 &
TP=8, DP=1, PP=1; block=32; mbt=32768; mseq=128; mem=0.95; EP=on &
lat 6722 / 17210 ms; TTFT p95 241 ms; TPOT p95 23.29 ms; QPS 5.073 &
DP-heavy points show instability (QPS drops, queue blow-up); some DP>1 runs fail (missing total\_time). Agent explores globally but suffers metric-schema mismatch and unstable baselines. \\
\hline
B &
Thinking trace (30min, scale=0.5) \texttt{anony-qwen-max-thinking-...} &
DeepSeek-R1 &
TP=4, DP=2, PP=1; block=64; mbt=65536; mseq=128; mem=0.95; EP=on &
lat 105986 / 385999 ms; TTFT p95 738 ms; TPOT p95 35.01 ms; QPS 1.016 &
Long outputs imply high concurrency demand (L\approx \lambda W \approx 108); small mseq triggers minute-level TTFT/latency blow-ups. Agent converges via DP+mseq, but config provenance inconsistency is observed. \\
\hline
C &
TraceA (60min, scale=1.0) \texttt{qwen\_traceA\_blksz\_16} &
Qwen3-Max &
TP=8, DP=1, PP=1; block=16; mbt=32768; mseq=256; mem=0.9; EP=on &
lat 6590 / 16925 ms; TTFT p95 242 ms; TPOT p95 23.08 ms; QPS 5.073 &
Arrival-rate-limited regime yields a flat performance plateau; agent starts near-optimal and performs local perturbations. Execution-time config mismatch observed (intended block change not applied). \\
\hline
\end{tabular}
\caption{Comparison across three tuning logs. TraceA exhibits an arrival-rate-limited plateau where throughput provides little gradient; thinking trace is concurrency-limited and requires DP+mseq to avoid queue blow-up. Agent backbones differ in schema compliance and exploration style; system-level guardrails are necessary for reliable learning.}
\label{tab:agent-compare}
\end{table*}
```

---

- 一页结论 bullets（适合 slide / paper “Findings”）

Workload 侧（决定“该搜什么”）

* **TraceA（QPS≈5）是 arrival-rate-limited**：多数稳定点 `overall_qps≈5.07`，throughput 不提供优化梯度；主要优化信号来自 **TTFT tail / TPOT / 排队结构**。
* **Thinking trace 是 concurrency-limited**：长输出导致 `L=λW` 很大（你这份约 100+），`max_num_seqs` 过小会触发 **分钟级 TTFT/latency 爆炸**；必须把 **DP + mseq** 纳入核心搜索维度。
* **同一模型/硬件下，最优并行形态随 workload 改变**：TraceA 偏 **TP=8, DP=1**；Thinking 偏 **TP=4, DP=2**（在满足并发容量前提下再保 TPOT）。

Agent 侧（决定“怎么搜”）

* **DeepSeek-R1 更偏全局覆盖探索**：容易覆盖到危险区域（不稳定/失败）从而揭示边界，但对 metric schema/稳定性 gate 更敏感（否则反馈闭环失真）。
* **Qwen3-Max 更偏强 baseline + 局部微扰**：在 plateau 场景（TraceA）收敛更快，但若迁移到 concurrency-limited 场景而仍局部搜索，可能长期卡在不稳定区边缘。
* **契约遵守（schema compliance）决定自动调参有效性**：metric 字段名不一致会直接造成 `proposal_eval` 失真；比“模型聪明与否”更先影响系统表现。

系统/工具链侧（决定“能不能复现/能不能学”）

* **需要稳定性 gate**：将队列发散（µ<λ）的 run 从比较/学习集合中剔除，否则 baseline/reward 被污染。
* **需要 plateau 检测与 early-stop**：TraceA 类 workload 极易出现“多点指标几乎相同”的平台区，继续探索浪费 budget。
* **必须保证 config provenance（提议=执行=记录）一致**：你两份日志都出现“提议参数未真正执行/落盘不一致”，这会摧毁 ablation 与可解释归因。

---

- 论文可用段落（Methodology / Discussion）

> 下面这段写法偏“方法学+实验观察”，不依赖外部引用，可直接放到论文中（你可把模型名/trace 名替换成更抽象的符号，避免过多工程细节）。

**中文学术写法：**

在本工作中，我们将大语言模型（LLM）作为调参代理（agent），通过“提出候选配置—运行真实回放基准—依据指标反馈迭代”的闭环，对推理引擎的并行策略与批处理参数进行自动搜索。我们对不同 agent backbone 的行为差异进行了对照：DeepSeek-R1 风格的代理更倾向于进行全局覆盖式探索（如跨 TP/DP 极值及大跨度 batch/seq 设定），能够较快暴露不稳定区域与系统边界；而 Qwen3-Max 风格的代理更倾向于从较强 baseline 出发做局部微扰，在到达率受限（arrival-rate-limited）的工作负载上迅速进入性能平台区（plateau）。进一步地，我们观察到 agent 能力并非唯一决定因素：若指标 schema 不统一（例如提案中使用的指标名与实际输出字段不一致）或配置落盘/执行不一致，则会导致反馈信号失真，从而显著削弱自动搜索的有效性与可解释性。因此，我们强调必须在系统层引入三类 guardrails：其一是稳定性 gate（剔除队列发散的 run，避免 baseline/reward 污染）；其二是 plateau 检测与自适应 early-stop（在到达率受限场景中减少无效探索）；其三是配置 provenance 机制（确保提议—执行—记录三者一致），以支持严谨的 ablation 与可复现实验结论。

**英文论文写法（可选，如果你要投国际会议）：**

We treat an LLM as an optimization agent that proposes serving configurations, executes trace-replay benchmarks, and iterates based on measured latency/throughput feedback. Across agent backbones, we observe distinct search styles: a DeepSeek-R1-based agent tends to perform global exploration over extreme TP/DP and batching settings, which quickly surfaces instability boundaries but is sensitive to metric-schema and stability handling; a Qwen3-Max-based agent starts from a strong baseline and applies local perturbations, reaching the performance plateau rapidly under arrival-rate-limited workloads. Importantly, agent capability alone does not guarantee effective tuning: metric-schema mismatches (proposal fields vs. logged metrics) and configuration provenance inconsistencies (proposed vs. executed settings) can corrupt the feedback loop and hinder explainability. We therefore introduce three system-level guardrails for reliable LLM-driven tuning: (1) a stability gate to exclude queue-divergent runs from comparison/learning, (2) plateau detection with adaptive early stopping to avoid wasting budget in flat regions, and (3) configuration provenance enforcement to ensure proposal–execution–logging consistency and enable rigorous ablations.

---

如果你希望把这页做成“更像论文图表”的形式（例如把 **Case A/B/C** 的 best config 画成一个小型并行/队列示意图，或把 **arrival-rate-limited vs concurrency-limited** 做成两栏对比），我也可以按你常用的 slide 风格给出一页排版稿（标题、两列布局、每列 3–5 bullets）。