Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/auto-tuner/Abs.md
+++ b/projects/auto-tuner/Abs.md
@@ -0,0 +1,57 @@
+现有的大模型推理框架都基于 Python 实现，这其中存在一些问题：
+- Python 运行速度较慢，调度部分的开销不可忽视
+	- 通过 torch/nsys profile 证明
+- 基于 PyTorch 的框架导致冷启动较慢
+	> Candle's core goal is to _make serverless inference possible_. Full machine learning frameworks like PyTorch are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight binaries.
+- 现有的 Python 架构没有一套好的抽象，data plane 与 control plane 交叉，如果希望动态调整配置，基本上需要重启
+- 调度层与推理 engine 层耦合
+- etc...
+
+
+现有的基于高效语言实现的推理框架存在的问题：
+- 无法支持最新的加速器算子，现有的许多性能最优的算子（flash-infer, ...）都由作者直接向 vllm/sglang 等社区 push，且针对 cpp/rust 的文档约等于没有，想要自己手动 port 使用上这些最新算子具有很大的工程挑战
+- 基于高效语言实现的推理框架尚未建立完善的社区生态的现状下，支持新模型的速度很慢，从而陷入恶性循环
+- 这些基于高效语言实现的推理框架未经过充分的验证，大家在推出新模型时都只在现有主流推理框架 vllm/... 上经过充分测试，保证能够稳定的正确吐词。现有的基于高效语言实现的推理框架相比 PyTorch/vllm/... 存在精度问题，长推理下会吐词崩溃
+
+
+由于上述这些问题，我们认为如果想要克服 Python 推理框架引入的问题，从零开始工程实现一个新的 Rust 推理框架是不现实的（社区已经有很多这样的工作），我们希望以 vllm 作为能够正确推理的标准，定义一套 IR，从而能够 vllm -> IR -> Rust framework，从而零成本的支持所有 vllm 支持的模型，且能够保持与 vllm 相同的推理结果保证推理正确性
+
+### Feedback
+
+一个 IR 可以给不同的 backend 引擎，现有的 inference framework 有：
+- [vLLM](https://github.com/vllm-project/vllm/tree/main)
+- [sglang](https://github.com/sgl-project/sglang)
+- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+- [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
+- [lightLLM](https://github.com/ModelTC/lightllm)
+- [lorax](https://github.com/predibase/lorax)
+- [punica](https://github.com/punica-ai/punica)
+- [MLC LLM](https://github.com/mlc-ai/mlc-llm)
+
+与 MLC LLM / TVM 的生态位差异：
+- TVM 擅长静态图，大模型推理的 shape 高度动态
+- MLC LLM / TVM 的核心想法是一次编译处处运行，效率并不一定高
+
+
+vLLM 的核心优势：
+- 资源管理，paged attention
+- 动态调度
+
+从某种角度来说，我们可以认为 vLLM 是一套大模型推理的前端调度、资源管理的分发决策器。
+
+我们的核心目标是一套针对云服务提供场景的高效推理，与 TVM 的生态位不同，因此我们希望抽象的 IR 应该是：
+- 针对**分布式系统**的大模型部署调优，因为单节点上算子在使用 cuBLAS, cuDNN 时已经达到了基本顶级的性能实现
+
+inference 时高度动态的 shape 和 Alpa 这类工作做最优的计算切分（静态）的矛盾
+
+
+
+---
+以上为现阶段的工作，但是我个人的经验无法定义清楚上面这套工作的价值与贡献度，这一套 vllm -> IR -> Rust framework 需要通过实际工作发现其中的挑战，如果这个过程中没有足够的科研工作的挑战（而更多的是一些工程挑战的话），我认为我们定义的这套 IR，也能够帮助我们做**自动化地大模型推理部署方案调优**。
+
+现有的大模型推理部署方案自动调优的工作都局限于有限的 search space：TP/PP/EP 的大小，可以归纳为简单的 profile+线性规划。我们希望能够 search 出一些更优的部署方案，而不是在有限的 space 内的调优，希望得到一些更偏向机制上的部署方案优化（而不仅是参数上）。
+
+
+因此我会在当前考虑如何抽象这套 IR 时，考虑未来服务于这么一套自动化推理部署方案调优的工作，目前认为一套计算图抽象是相对可能的一个方案。
+
+
--- a/projects/auto-tuner/Ali
+++ b/projects/auto-tuner/Ali
@@ -0,0 +1,766 @@
+
+- AITuner Trace A | 10 min
+https://nas.gahow.org/webdav/ai_tuner_logs/dash_prod/ai_tuner_260307-152305.jsonl
+```
+2026-03-08 06:28:06 [INFO] run_trace_replayer summary: success=True best_goodput_qps_per_gpu=0.798171666969416 selected_time_scale=0.4678486930206418
+2026-03-08 06:28:18 [INFO] Agent loop finished. Total runs=5, LLM calls=6
+2026-03-08 06:28:18 [INFO] Original vs best production config diff:
+2026-03-08 06:28:18 [INFO]   engine_args.block_size: 64 -> 32
+2026-03-08 06:28:18 [INFO]   engine_args.enable_expert_parallel: False -> True
+2026-03-08 06:28:18 [INFO]   engine_args.max_num_batched_tokens: 8192 -> 32768
+2026-03-08 06:28:18 [INFO]   engine_args.tensor_parallel_size: 4 -> 2
+2026-03-08 06:28:18 [INFO]   envs.VLLM_MOE_USE_DEEPEP: '0' -> '1'
+2026-03-08 06:28:18 [INFO] Baseline vs best production config diff:
+2026-03-08 06:28:18 [INFO]   engine_args.block_size: 64 -> 32
+2026-03-08 06:28:18 [INFO]   engine_args.enable_expert_parallel: False -> True
+2026-03-08 06:28:18 [INFO]   engine_args.max_num_batched_tokens: 8192 -> 32768
+2026-03-08 06:28:18 [INFO]   engine_args.tensor_parallel_size: 4 -> 2
+2026-03-08 06:28:18 [INFO]   envs.VLLM_MOE_USE_DEEPEP: '0' -> '1'
+2026-03-08 06:28:18 [INFO] Persisted best production config to /usr/local/lib/python3.12/dist-packages/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20.config
+2026-03-08 06:29:02 [INFO] AI tuner JSONL log: runs/ai_tuner_logs/ai_tuner_260307-152305.jsonl
+```
+
+
+```
+(codex-tuner) wjh@ds-dda11ac6-1-847f4dd4c5-vg6mr:~/auto-tuner/tuner$ cat micro_trace_replayer/comm_tp4_noep_traceA_10min.jsonl.summary.json
+{
+  "duration_ms": "600078.484",
+  "e2e_mean_ms": "429.185",
+  "e2e_p90_ms": "1163.643",
+  "e2e_p95_ms": "1557.812",
+  "e2e_p99_ms": "2594.410",
+  "output_tokens_total": "2048",
+  "requests_success": "2048",
+  "requests_total": "2048",
+  "throughput_rps": "3.413",
+  "throughput_tps": "3.413",
+  "tpot_mean_ms": "0.000",
+  "ttft_mean_ms": "0.000"
+}
+(codex-tuner) wjh@ds-dda11ac6-1-847f4dd4c5-vg6mr:~/auto-tuner/tuner$ cat micro_trace_replayer/ali_tp4_noep_traceA_10min.jsonl.summary.json
+{
+  "duration_ms": "600082.137",
+  "e2e_mean_ms": "523.779",
+  "e2e_p90_ms": "1478.267",
+  "e2e_p95_ms": "2008.634",
+  "e2e_p99_ms": "3050.164",
+  "output_tokens_total": "2048",
+  "requests_success": "2048",
+  "requests_total": "2048",
+  "throughput_rps": "3.413",
+  "throughput_tps": "3.413",
+  "tpot_mean_ms": "0.000",
+  "ttft_mean_ms": "0.000"
+}
+```
+
+## 现存的问题
+
+阿里的生产环境上测试的链路是 `API -> Gateway -> Chat Serving (CPU machine for encoding) -> Model Serving (vLLM)`
+此时存在一个问题，AITuner 不再能直接获取到 vLLM 引擎上的各种性能相关的指标（kvcache hit ratio / kvcache usage / queuing ratio / avg running / avg queuing）
+
+`"VLLM_ATTENTION_BACKEND": "FLASH_ATTN_VLLM_V1" -> "FLASH_ATTN"` [link](https://code.alibaba-inc.com/PAI-LLM/vllm/commit/2f5a20b76049a33d79bbe2a4ab8abd8ae84581c7?spm=21540d8c.2ca2ac6d.0.0.62cf790b5rvf9T)
+
+内部存在更多不存在清楚文档定义的中间状态值，tuning 系统对此难以感知，缺少每个 knob 清晰的可取值范围与对应的含义
+
+
+## 现状总结
+
+`git log --oneline -- util/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20_prefill.config`
+
+一个模型（Qwen3-235B-A22B）的 prefill 版本在 25.11.18 - 26.01.08 有 10 次 commit
+
+
+
+## 补充
+
+所有修改 configs 的 commit：
+
+`git log --oneline -- util/vllmgen/configs`
+
+从 25.07.31 (vllmgen 的首次提交) ～ 26.02.03 这段时间（大约 6 个月）总共有 92 次 commit（平均每 2 天一次 commit）
+
+
+
+## configs commit 总结
+
+ecc7b539136cd34d29bf4cdb215ae298c019be76
+initial commit: common.config
+
+36faa64b969b633187645b29fa997b177d120f37
+添加 util/vllmgen/configs/qwen3-235b-a22b-h20-4.config
+
+929e6bc9c9ac17fdd087ed74c9dadc41c2812d43
+添加 util/vllmgen/configs/qwen3-235b-a22b/128k-0426-hf-fp4/h20.config、util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config
+
+cb65b0c128c07725817278f2ddb020ddd2df0722
+util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config，加长 max_seq_len
+
+22d382a13baf68e278dd1c616b379d59f34dcdbe
+util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config，更新 use vllm v1
+
+==09b005191f10e2b93c3159c6d0d0d6735287269b==
+util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config 添加：
+```
+"compilation_config": {
+	"use_inductor": false,
+	"custom_ops": [
+		"all"
+	]
+},
+```
+
+9007270573a7a6c3a8d88bcc45261751fd2379f3
+添加 util/vllmgen/configs/draft-models/qwen3-235b-a22b/0723-thinking-eagle-0821/h20.config
+
+7f392d7e608974fc56f9fd6913d6be3f52ff8ec9
+更新 util/vllmgen/configs/qwen3-235b-a22b/256k-0723-thinking-fp4/h20.config 和 util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config 的 `max_num_batched_tokens`
+添加 util/vllmgen/configs/qwen3-vl-30b-a3b/128k-0717-fp4/h20.config
+
+4e90f0582cd7f86312e5ae01c9cd7d59132723fd
+util/vllmgen/configs/qwen3-vl-30b-a3b/128k-0717-fp4/h20.config 中 `enable_prefix_caching` 开启
+
+4535fd0e39f8a181c73a7669eaa379feae26fe57
+添加 util/vllmgen/configs/qwen3-vl-235b-a22b/256k-thinking-0920-fp4/h20.config
+
+60296b817616aaeec919fc9dc68b11dc9f9e9124
+删除 VLLM_ENABLE_TORCH_COMPILE
+
+39a9bb43feb6e6b22f7ed8cefe6442d973bf6057
+typo
+
+781fee1a1a9ba86228684c1f2a1e0f9726ab3e2b
+支持 qwen-coder/max/next
+
+b1c5efe09ef9df162d442fc7b71932d41f3a426b
+change meta
+
+a3389997e420de0b9c45342be6f806ff9179e193
+use `PIECEWISE` compilation_config for cudagraph_mode
+
+dc0761d070e9068f64e3c33926fe3bc0c7d62f7b
+remove unused TP size
+
+dc83943aa9f3b2d4c90dde3c6f2fe49458877ba1
+bug fix set VLLM_USE_UPSTREAM_FA3=0
+
+216194960c290d64914ffad88338587069889e6c
+删除 `PIECEWISE`
+
+b75c0cf11dedcf4ee0396ec939e1e0e80267593f
+typo
+
+a09d8da72fc9ca837862b61850cdb324223cfd76
+typo
+
+34b32eb885ecb07296d7a2bbe58d5e0ee028d30c
+添加：
+```
+"VLLM_USE_FLASHINFER_SAMPLER": "0"
+"VLLM_ENABLE_TORCH_COMPILE": "1"
+```
+
+adba559382c4ab60738a0b2783e27acc801d2ab7
+设置 `"VLLM_FLASH_ATTN_USE_UPSTREAM": "0"`
+
+517bc31caaf26c6f1cd119a5283f2e5b2cb141c7
+使用
+```
+"compilation_config": {
+	"cudagraph_mode": "PIECEWISE",
+	"use_inductor": false,
+	"custom_ops": [
+		"all"
+	]
+}
+```
+
+4259a99557467b48cbae606c0092bf0b75ab3f84
+meta
+
+e1d1b64420426406afcfe3a36d10172b6473ead8
+修改 gpu memory util 避免 kvs swap 的 OOM
+
+13294741780ff79bbab8a54e98286e07a1ba4b0e
+meta
+
+5b2f35fe49033711c9cfcbd2afe7036905f462a6
+`"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
+
+ea95b933221ba192d9e0a2a06e6c84bee4a32544
+`"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
+
+4994e414d8ee812feebe23eed473abcf5166f246
+```
+"VLLM_FLASH_ATTN_FP8_ATTENTION": "1",
+"VLLM_ENABLE_TORCH_COMPILE": "1"
+```
+
+985ffa6156c37f8a299870722f51bdfec9fba2f4
+添加 5090 support  util/vllmgen/configs/qwen3-30b-code-switch/v2/5090.config
+
+ae600ea07735cbe03b1b862054133d805f85ed5e
+支持 PD 分离：util/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20_prefill.config
+
+98038e62d8f3f7b94079cc287bdf10e4f4c64267
+支持一系列 qwen-vl
+
+2f5a20b76049a33d79bbe2a4ab8abd8ae84581c7
+typo fix
+
+cc3484b98cfeccc8c9d6426c4f1238545832cc14
+删除 `"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
+
+d71559b5f3fd6084bc738c4dc22af0566cf133eb
+meta
+
+3ec865d4b2db0cd98b537124d5ca791e458cba93
+支持一系列闭源 qwen-vl 模型
+
+31a07b6eceb0bda01dcd0db4da0e4d390251facb
+no bladnn gdn by default
+
+896de6b72d079bf4d9c568667189c3d711e9ebde
+meta
+
+213d89408b1c45d6ae5ed6593268dc41d07eb12e
+use `VLLM_FLASH_ATTN_FP8_ATTENTION`
+
+de918bd2d52e30cba3cff640af3f1e73fdb841da
+add qwen3-vl-235b-a22b pd
+
+7835de1c4fffb8f2c5ad1a8af3501d1dfcc60728
+support qwen-vl-max dashgen on L20X and A800
+
+b3d465adb53e993bc4bfc069ad869232fb36823d
+make draft model support PD decoupling
+
+dcc0ab7fc9ed21c0763086f1ae1428633f53ba8e
+fix rope scaling 的位置
+
+==f66a954d5e68c829f7d841ed9aa020ec45d11436==
+修改 `gpu_memory_utilization`, `max_num_batched_tokens`, `cuda_graph_sizes` (P 和 D 不同）, `cudagraph_mode`
+
+75ab97b48ffab568b69f9dfd34ffbee4d10c771b
+meta
+
+f67b2e2ede2b88bf27827a8edd96e38188bccb69
+移除 "async_scheduling": false 和 "disable_custom_all_reduce": true
+
+9c497637ad23eb9a9abaf53327e4ff263a4630ea
+Encode-Prefill-Decode Disaggregation on v6d
+
+b739c437bdac2ba0751853ed4335019f7b7e8486
+移除 "async_scheduling": false
+
+85a66eb3a241f900ea785ae0a9b2ac5981f9e9b6
+支持 util/vllmgen/configs/qwen3-30b-a3b/1m-thinking-0728-fp4-cs-gate/h20.config
+
+44c01d5177f36e4b0c2802b6aac8439136f79685
+支持 util/vllmgen/configs/qwen3-omni/qwen3_omni_final_ckpt_multilingual_0915/h20.config
+
+936f71e642edd3a87286deabff38ae73c7bd3aea
+修改 mm_processor_cache_gb 50 -> 10
+
+==0c8dab3e33fc93e98fa15a37b4aa784aa1316abc==
+降低 decode max_num_batched_tokens 8192 -> 1024, 提高 decode max_num_seqs 128 -> 192, 提高 prefill  max_num_batched_tokens 4096 -> 8192，修改 decode 的 cuda_graph_sizes，增加 128 和 192
+
+ca65f85a37efa5ef95912d1618b0f5405df53c7c
+support moe eplb
+
+741d586dd1fc0bc98c16fd16146d1f77e72ba4e8
+支持 qwen3-max-fp pd
+
+49486c6187263204427074780ac22b4aef475a55
+meta
+
+d62a67eeb1eba461fa434a0297717c838d1e7d7d
+支持 kvs，gpu_memory_utilization 0.89 -> 0.88
+
+e1fdbf4629f8b70463ae45cc2b0e408712e60af4
+add qwen3-vl-plus pd
+
+6b18610c39857c90d5e076b8b693ab847b92a107
+PD use eth1
+
+0cb6b11757a5eb46f6a021cee8b7f08b84256410
+支持 qwen3-plusp
+
+8915d3fe5c587c2268dfa49dfbf022d2bfd6c7b0
+更新 qwen-vl-max
+
+383ce3e60eb9904faa61006452b2be4ff250ec24
+chore：mm_processor_cache_gb 含义 false -> 0
+
+8b30846690623a80b51e975718a84a0d571a97ed
+添加支持：util/vllmgen/configs/qwen-vl-max/epd_disagg_llm/h20.config
+
+de80aaf239033cf0d59e32e0bf3beee16651fed7
+移除 qwen-vl-max 在 a800 上使用 fp8（不被 a800 支持）
+
+9d63c50b3328a1d748929fb4f76cff49ac883b7d
+chore：mm_processor_cache_gb 含义 false -> 0
+
+==2fe8ab568315c227cfbade4e7759b1767cfa003c==
+util/vllmgen/configs/qwen3-coder-plus/1m-0922-re-fp8/h20.config 设置 TP4，关闭 kv transfer
+
+5a12337e3f38b0d11b21604a1f3f5e739d6b126b
+chore: 移除 num_experts: 0
+
+5f591ffe74f15fe9e5b4cd5864471b3ffa76874a
+将模型解码配置中的松散 top-k 验证关闭，并调整了异步调度为禁用状态，同时略微增加了 GPU 内存利用率
+
+2e87a5d3e86e597efea8478cc74dd92a9d6dba0b
+支持 coder plus 0923 pd + kvs
+
+bdbdf44d4c67462ca5d120b25684a5b8804a7a18
+因为今天的 commit 更新了 fp4 perchannel 的加载逻辑，所以需要对模型配置进行更新
+
+85c50bf76194bf52dc2e3e1ce7c8b2e0e17cbfc1
+添加 loose topk 相关
+```
+"use_loose_topk_verify": 1,
+"loose_topk_verify_threshold": 0.95,
+"loose_topk_verify_max_tokens": 2000,
+```
+
+d350466cf7d71b4dbcd03bb8ff1ec60b648940a6
+chore: VLLM_KVT_MAX_DELAY_MS
+
+936a3b7286425fdd7dc996b6b4f23f19cabdb83b
+支持 util/vllmgen/configs/draft-models/ci-test/model-version-v1/h20_batch.config
+
+b3d96f257d229ab7b8d6d6a1b5f92f7d740acd7b
+util/vllmgen/configs/qwen3-30b-a3b-with-gate-next-fp4/instruct-fp4/h20.config 支持 kvs，`max_num_batched_tokens`, 1024 -> 4096
+
+1ca19f8042d081c437bf036ca469c946c6c30868
+添加
+```
+"pass_config": {
+	"fuse_norm_quant": false,
+	"fuse_act_quant": false,
+	"fuse_attn_quant": false
+}
+```
+
+80e602a32f0cd59f84e834049161dff5a1aaa6a8
+支持 util/vllmgen/configs/qwen3-30b-code-switch/v2/5090.config，移除 flashinfer
+
+7018965a139a5763809e888eb0c35ef9743ad3d8
+添加 util/vllmgen/configs/qwen3-235b-a22b/256k-0723-think-cs/h20_prefill.config，更新所有 model 的 PD connection 相关环境变量
+
+6a1b40876d4bc884a805effa0cc7c7a28c38ea01
+plusp 等模型支持 kvs
+
+613d885a17360b9f90c10bedf8d900815d8b5668
+util/vllmgen/configs/qwen3-plusp/256k-1106/h20_prefill.config，移除 "enforce_eager": false 与 cuda_graph_sizes
+
+29f88ea3e6addf5a155c2bf000e7819171be0b39
+fix
+
+1ad174a24cad5e51e2ccf71a828751d4e65c51a6
+util/vllmgen/configs/qwen3-omni/qwen3_omni_final_ckpt_multilingual_1124/h20.config, enable_chunked_prefill 开启，添加 max_model_len: 65535
+
+c8f776e277a6545b3e1696a4e9261282bca681f9
+Disable the preprocess cache for vl model by default
+
+9d011bf9ec443f7d3ccecd7e44b6d9a526a0089a
+增加了任务绕过配置项 `VLLM_ENABLE_BYPASS_TASK` 并调整了异步调度、GPU内存利用率及前缀缓存的相关设置，关闭 enable_prefix_caching
+
+dab633c1871054cfe823669e970b04d89ba3b7b5
+删除：
+```
+"DS_LLM_ENABLE_DISAGGREGATED_VIT": "1",
+"DS_LLM_LAUNCH_VINEYARD": "0",
+"SRPC_STREAM_DISABLE_BAREX": "1"
+```
+
+67ab53a5f673f4fc629c2a4354e2b3ddf556dc08
+Support Qwen3.5 VL MTP
+
+27d64ee33ecac2b94efaa44f5bbefa5998574d05
+Add KVS config for Qwen3-VL-Plus
+
+8fb2d746069f17e5c6e951b0fd2f5da4c608f3e1
+util/vllmgen/configs/qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config, max_num_seqs 256 -> 512
+
+ebf2b71630984c383f3824eb7e4e45a617416274
+"max_num_batched_tokens": 8192 for util/vllmgen/configs/qwen3-vl-plus, prefill
+
+c18e84e20cc15bbc2d14ddb17b26aaf7dc78e94f
+添加 `speculative_config`：
+```
+"num_speculative_tokens": 32,
+"hf_overrides": {
+"max_position_embeddings": 262144
+}
+```
+
+3bef0c622d1ec18e652e82cbc24ebf481e9025d5
+支持 qwen3-max on l20c
+
+ce495cc242affcfb321cf6dcabf190c3f1589667
+qwen3-vl-flash: enable async_schedule
+
+8152532eeb0a191e69b868cc888392a0c74ab0fe
+qwen3-vl-flash 添加
+```
+"pass_config": {
+	"fuse_norm_quant": false,
+	"fuse_act_quant": false,
+	"fuse_attn_quant": false
+},
+```
+
+
+
+下面这份结论，是**基于你给出的 commit 摘要**做的归类分析，不是逐个 diff 逐行审阅后的法证级结论；但从系统角度看，已经足够支持一个很清晰的判断：**这个仓库里的 config churn，大多数不是在做真正的性能调优，而是在做模型接入、功能启用、兼容修复和运行稳定性修补。**
+
+# Commit Analysis
+## 1. 结论先说
+
+你列出的 commit 一共 **91 个**。
+
+我建议用两个口径看“性能相关占比”：
+
+| 口径   | 定义                                                                                                                                                                    |       数量 |        占比 |
+| ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------: | --------: |
+| 严格口径 | **直接修改已有配置中的性能 flag 取值**，例如 `max_num_batched_tokens`、`max_num_seqs`、`gpu_memory_utilization`、`async_scheduling`、`cuda_graph_sizes`、`torch compile`、`prefix caching` 等 |   **28** | **30.8%** |
+| 宽口径  | 在严格口径基础上，再加上**新引入性能相关机制/后端/并行形态**，例如 speculative、FP8 attention、PD/KVS 相关配置接入等                                                                                         | **约 40** | **44.0%** |
+|      |                                                                                                                                                                       |          |           |
+
+也就是说：
+
+* **如果你关心的是“真正做 tuning 的 commit”**，占比大约只有 **三成**；
+* **如果你把性能相关 feature enablement 也算进去**，最多也就 **四成多**；
+* 因此，“**绝大部分 commit 与性能优化本身无关**”这个观察，在严格口径下是成立的。
+
+这其实很有价值：它说明**生产配置仓库的变化，主要是 control-plane churn，而不是 dataplane tuning**。
+
+---
+
+## 2. 从第一性原理看，为什么会这样
+
+一个线上 vLLM per-model config 仓库，commit 变多，未必意味着“性能状态空间巨大”。更常见的是四类变化：
+
+1. **接入新模型 / 新硬件 / 新模式**
+   例如新增 qwen3-vl、omni、5090、A800、L20C、PD、KVS、MTP。
+   这类变化本质上是“能不能跑 / 怎么接入”，不是“已经能跑后如何把 goodput 调到最优”。
+
+2. **兼容性与正确性修复**
+   例如 typo、rope scaling 位置修复、fp4 per-channel 加载逻辑变化、某后端 upstream 不可用、某硬件不支持 fp8。
+   这类变化的目标是“避免错 / 避免 crash / 避免不支持”。
+
+3. **稳定性与资源安全边界调整**
+   例如 `gpu_memory_utilization` 从 0.89 到 0.88，避免 KVS swap OOM。
+   这类变化和性能有关系，但很多时候**优先级是 survivability，不是 peak throughput**。
+
+4. **真正的 steady-state 性能调优**
+   例如：
+
+   * `max_num_batched_tokens`
+   * `max_num_seqs`
+   * `async_scheduling`
+   * `cuda_graph_sizes`
+   * `cudagraph_mode`
+   * `torch compile`
+   * `prefix caching`
+   * `chunked_prefill`
+   * `TP`
+
+   这类 commit 才是你论文里最该抓住的部分。
+
+---
+
+## 3. 哪些 commit 真正改了性能相关 flag
+
+我把它们按系统机制分组。
+
+### A. batching / scheduling / graph capture 相关
+
+这组是最“像 tuning”的。
+
+* `7f392d7e...`
+  修改 `max_num_batched_tokens`
+
+* `f66a954d...`
+  修改 `gpu_memory_utilization`、`max_num_batched_tokens`、`cuda_graph_sizes`、`cudagraph_mode`，并区分 P / D
+
+* `0c8dab3e...`
+  decode: `max_num_batched_tokens 8192 -> 1024`
+  decode: `max_num_seqs 128 -> 192`
+  prefill: `max_num_batched_tokens 4096 -> 8192`
+  同时修改 decode `cuda_graph_sizes`
+
+* `f67b2e2e...`
+  移除 `"async_scheduling": false`、`"disable_custom_all_reduce": true`
+
+* `b739c437...`
+  移除 `"async_scheduling": false`
+
+* `5f591ffe...`
+  关闭 loose top-k verify，关闭 async scheduling，略微增加 `gpu_memory_utilization`
+
+* `8fb2d746...`
+  decode `max_num_seqs 256 -> 512`
+
+* `ebf2b716...`
+  prefill `max_num_batched_tokens = 8192`
+
+* `ce495cc2...`
+  qwen3-vl-flash 开启 `async_schedule`
+
+* `1ad174a2...`
+  开启 `enable_chunked_prefill`，增加 `max_model_len`
+
+* `613d885a...`
+  移除 `enforce_eager: false` 与 `cuda_graph_sizes`
+
+#### 这类变化背后的理由
+
+这是最典型的系统 tradeoff：
+
+* `max_num_batched_tokens` 决定**单 step 的 token 工作量上限**。
+  增大它，通常有助于提高 GPU occupancy 和算子摊薄开销；但也会带来更高显存压力、更长单步尾延迟、更难 graph capture。
+
+* `max_num_seqs` 决定**并发宽度**。
+  decode 场景中，增大它往往是为了让更多序列并发推进，提升 tokens/s；但太大时 scheduler overhead、KV 压力、尾延迟都会上来。
+
+* `prefill` 和 `decode` 的最优点本来就不一样。
+  `0c8dab3e...` 这种“prefill 提大 token budget，decode 降低 token budget 但提高 seq 数”的改法，系统上非常合理：
+
+  * prefill 更偏大矩阵、吃吞吐、适合更大的 batch-tokens
+  * decode 更偏 memory/bandwidth、每步 token 少、适合更高并发序列数而不是更大 token lump
+
+* `cuda_graph_sizes` / `cudagraph_mode` 本质是在赌**shape 是否足够稳定**。
+  如果 workload 形状集中，graph capture 能减少 launch overhead；
+  如果 shape 太散，graph 容易失效甚至成为负担，所以会看到 `PIECEWISE` 被加上又删掉。
+
+* `async_scheduling` 不是“永远更快”。
+  它在高并发下可能更好，但也可能引入调度抖动、与某些校验逻辑冲突、或在某些模型/路径上带来额外复杂性，所以会出现有的模型开启、有的关闭。
+
+---
+
+### B. memory / cache / 长上下文相关
+
+* `cb65b0c1...`
+  加长 `max_seq_len`
+
+* `4e90f058...`
+  开启 `enable_prefix_caching`
+
+* `e1d1b644...`
+  调整 `gpu_memory_utilization`，避免 KVS swap OOM
+
+* `d62a67ee...`
+  支持 KVS 时，`gpu_memory_utilization 0.89 -> 0.88`
+
+* `936f71e6...`
+  `mm_processor_cache_gb 50 -> 10`
+
+* `383ce3e6...`, `9d63c50b...`
+  `mm_processor_cache_gb false -> 0`
+
+* `c8f776e2...`
+  默认关闭 VL preprocess cache
+
+* `9d011bf9...`
+  调整 `gpu_memory_utilization`、关闭 `enable_prefix_caching`
+
+* `5b2f35fe...`, `ea95b933...`, `cc3484b9...`
+  `VLLM_VL_VISION_NUM_GRID_PER_SIDE = 27`，随后删除
+
+#### 背后的理由
+
+这组变化本质上都在处理一个问题：
+
+> **显存不是只给模型权重和 KV cache 用的。**
+> 在长上下文、VL、KVS、PD 这些场景里，预处理缓存、视觉网格、prefix cache、KV 传输缓冲都会争显存。
+
+所以你会看到两个典型方向：
+
+1. **给显存留安全边界**
+   `gpu_memory_utilization` 下调，不是因为更快，而是因为再高就会触发 swap / OOM / 碎片化恶化。
+   这类 commit 是“性能相关”，但更准确说是**容量边界调优**。
+
+2. **缓存未必总是划算**
+   `prefix caching`、`preprocess cache`、`mm_processor_cache_gb` 只有在命中率高时才值得。
+   在多模态、长上下文、任务混杂的真实线上场景里，缓存可能带来：
+
+   * 显存占用
+   * 管理开销
+   * 缓存污染
+   * 命中不高
+
+   所以会出现“先开后关”或“默认关闭”的变化，这说明**它不是 universally beneficial 的 knob**。
+
+---
+
+### C. kernel / backend / compile path 相关
+
+* `09b00519...`
+  `compilation_config.use_inductor = false`，`custom_ops = ["all"]`
+
+* `60296b81...`
+  删除 `VLLM_ENABLE_TORCH_COMPILE`
+
+* `a3389997...`
+  使用 `PIECEWISE` 的 `cudagraph_mode`
+
+* `21619496...`
+  删除 `PIECEWISE`
+
+* `34b32eb8...`
+  `VLLM_USE_FLASHINFER_SAMPLER = 0`
+  `VLLM_ENABLE_TORCH_COMPILE = 1`
+
+* `adba5593...`
+  设置 `VLLM_FLASH_ATTN_USE_UPSTREAM = 0`
+
+* `dc83943a...`
+  设置 `VLLM_USE_UPSTREAM_FA3 = 0`
+
+* `4994e414...`, `213d8940...`
+  启用 `VLLM_FLASH_ATTN_FP8_ATTENTION`
+
+* `1ca19f80...`, `8152532e...`
+  `pass_config.fuse_* = false`
+
+* `31a07b6e...`
+  no bladnn gdn by default
+
+#### 背后的理由
+
+这一组非常像现实生产环境中的“后端路径钉死”：
+
+* 理论上更激进的 backend / fusion / compile path **不一定更快，更不一定更稳**；
+* 一旦涉及 FP4/FP8、VL、quant、特殊注意力后端、特定硬件，实际最优路径往往依赖非常具体的组合条件。
+
+所以这类 commit 的真实含义通常不是：
+
+> “发现了一个更高级的性能优化”
+
+而是：
+
+> “某条后端路径在这个模型/精度/硬件组合上不稳、不可用、或者综合收益不好，所以钉到已验证的实现上”。
+
+这类 commit 当然影响性能，但其本质更接近**后端选择与兼容性收敛**，不是典型意义上的 black-box tuning。
+
+---
+
+### D. 并行形态 / 系统结构相关
+
+* `2fe8ab56...`
+  `TP=4`，关闭 KV transfer
+
+* `85c50bf7...`, `5f591ffe...`
+  loose top-k verify 相关开关，后来关闭
+
+* `c18e84e2...`
+  增加 `speculative_config.num_speculative_tokens = 32`
+
+* `ae600ea0...`, `de918bd2...`, `b3d465ad...`, `9c497637...`, `741d586d...`, `e1fdbf46...`, `2e87a5d3...`, `27d64ee3...`
+  PD / EPD / KVS 相关支持
+
+#### 背后的理由
+
+这一组说明一个更重要的事实：
+
+> 真正影响 serving config 的，常常不是单个 flag，而是**系统结构变了**。
+
+例如：
+
+* 一旦引入 **PD/EPD**，prefill 和 decode 的最优 `max_num_batched_tokens`、`max_num_seqs`、`cuda_graph_sizes` 就不该相同；
+* 一旦引入 **KVS**，你必须重新分配显存 headroom；
+* 一旦使用 **speculative decoding**，decode 最优点就会改变；
+* `TP` 的选择本质上是**算力并行收益 vs 通信开销 vs KV 布局 vs 单机容量**的平衡。
+
+所以这些 commit 说明的不是“flag 很多”，而是：
+
+**配置空间真正复杂的来源，是系统形态切换，而不是几十个 flag 都在频繁被精细调参。**
+
+---
+
+## 4. 真正被反复调的“核心性能 flag”其实不多
+
+从你这批 commit 看，反复出现的核心性能 knob 主要就这些：
+
+* `max_num_batched_tokens`
+* `max_num_seqs`
+* `gpu_memory_utilization`
+* `async_scheduling`
+* `cuda_graph_sizes`
+* `cudagraph_mode`
+* `VLLM_ENABLE_TORCH_COMPILE`
+* `enable_prefix_caching`
+* `mm_processor_cache_gb`
+* `TP`
+* 若干 backend 选择：
+
+  * `VLLM_FLASH_ATTN_FP8_ATTENTION`
+  * `VLLM_FLASH_ATTN_USE_UPSTREAM`
+  * `VLLM_USE_UPSTREAM_FA3`
+  * `VLLM_USE_FLASHINFER_SAMPLER`
+  * `pass_config.fuse_*`
+
+这对你的故事很重要：
+**配置文件里看起来 flag 很多，但真正反复进入性能决策回路的，是一个相对低维、重复出现的核心子集。**
+
+---
+
+## 5. 这批 commit 最能支持你的论文/文档里的什么结论
+
+我建议你把结论写成下面这个脉络：
+
+### 结论 A：生产环境 config churn 很大，但大部分不是性能 tuning
+
+因为大量 commit 属于：
+
+* 新模型接入
+* 新硬件适配
+* PD/KVS/VL/MTP 等功能接入
+* typo / meta / chore
+* 正确性与兼容性修复
+
+所以**不能把“commit 多”直接等价成“性能配置空间巨大且持续被深度优化”**。
+
+### 结论 B：真正的性能 tuning 集中在少数几个 recurring knobs 上
+
+主要就是：
+
+* batching 宽度
+* scheduling 策略
+* graph/compile 开关
+* memory headroom
+* cache 开关
+* backend path
+* TP / PD / KVS 等结构性选择
+
+也就是说，**性能空间不是“很多杂乱 flag 的完全黑盒空间”，而是有明显系统结构的。**
+
+### 结论 C：许多所谓“性能相关 commit”，本质是在修资源边界和后端稳定性
+
+例如：
+
+* `gpu_memory_utilization` 下调避免 OOM
+* 关闭 prefix / preprocess cache
+* 钉死某个 flash-attn 或 compile backend
+* 删除不稳定的 graph mode
+
+这说明线上 config 演化的第一目标往往是：
+
+> **先稳定可运行，再在少数核心 knob 上做局部优化。**
+
+---
+
+## 6. 我给你的最终判断
+
+你的原始直觉是对的，而且这个观察其实很强：
+
+> **vLLM per-model config 的 commit 很多，但绝大多数并不是“性能相关的精细调优”；真正持续影响性能的，是少数几个 recurring flag，以及少数系统结构切换（PD/KVS/TP/speculative）。**
+
+如果你愿意，我下一步可以直接把这些内容整理成一份更正式的中文内部文档，结构做成：
+
+**背景 → 方法 → commit 分类统计 → 高频性能 flag → 每类 flag 的系统解释 → 对 AITuner 研究的启示**
+
+这样你可以直接发给同事或 supervisor。
--- a/Cluster.figs/260410-105227-1.png
+++ b/Cluster.figs/260410-105227-1.png
--- a/projects/auto-tuner/Heterogenous
+++ b/projects/auto-tuner/Heterogenous
--- a/projects/auto-tuner/Heterogenous
+++ b/projects/auto-tuner/Heterogenous
@@ -0,0 +1,105 @@
+## Background
+
+线上负载在许多方面具有多样性，例如：类型，SLO 需求，etc。不同的负载可能具有不同的并行模式亲和性，例如：对于长请求可能更适合 SP，batch size 大的可能更适合大 EP 等等。
+
+因此我们想尝试的一件事是，在集群中使用多种不同的并行模式配置的实例，将线上负载分发给最合适的实例，从而实现最大的资源利用率。
+
+## Challenges
+
+- 需要证明线上负载确实需要不同的并行模式，而不是 one-size-fit-all
+- 线上负载是否需要做 dynamic 的并行模式 reconfiguration，还是 static 的模式就行
+	- 假如需要 dynamic 的 reconfiguration，如何高效实现？尤其是在需要考虑 EP size 的 reconfiguration 的情况下，是否存在新的挑战？
+- 参考 DynamoLLM 的方法学（profile+线性规划），在更大的 design space 下（SPxPPxTPxEP）是否仍然 work，有可能需要新方法。
+
+- 请求类型：长度/SLO/**online or batch**/kvcache hit or miss/PD/...
+- 不同机器 H20 or A800，计算/带宽
+- 大 design space（请求类型、机器、并行模式、模型）下如何自动搜索
+- reconfiguration 的 overhead 和实时负载的最优 setup 的 trade off
+- depends on context (current system status), How to define context?
+
+## Roadmap
+
+Prerequisite：处理并分析最新的 qwen-235b 上的 2h trace，用于后续测试。
+
+Week1: 在 vLLM 上实现性能测试平台，实现能够支持 EP/TP/PP=1,2,4,8 的性能测试，衡量相同输入的情况下，平均每张卡的吞吐 / 请求的平均延迟等。
+Week2: 分析总结不同的并行模式分别会对性能有什么影响，归纳线上负载是否对不同并行模式存在请求亲和性，以及尝试理论分析如果使用不同并行模式是否相比单一部署模式具有提升空间，理论最大的提升上界
+
+## TBD
+
+请求分类的方法学是什么？除了长度 / SLO 需求，还有什么维度，对于 online 的请求，大概率不会存在一个明确的 tag 标识不同的 SLO 需求，如何区分？
+
+
+## 现状
+
+https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html
+
+> While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed.
+
+
+vLLM 已经开始支持 EPLB：
+```
+# Single node with EPLB load balancing
+VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3-0324 \
+    --tensor-parallel-size 1 \     # Tensor parallelism
+    --data-parallel-size 8 \        # Data parallelism  
+    --enable-expert-parallel \      # Enable EP
+    --enable-eplb \                 # Enable load balancer
+    --eplb-log-balancedness \       # Log balancing metrics
+    --eplb-window-size 1000 \       # Track last 1000 engine steps
+    --eplb-step-interval 3000       # Rebalance every 3000 steps
+```
+
+
+
+
+## Reference
+
+1. 【HPCA】[DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency](https://arxiv.org/abs/2408.00741v1)
+
+Such variations arise from:
+(1) requests with varying input/output token lengths
+(2) request load fluctuations
+(3) distinct compute properties of different LLMs
+(4) different SLOs required by the services using an LLM
+
+input/output 长度分为 9 类，{S, M, L} * {S, M, L}
+结合 profile，求解 MILP，最小化 energy 开销
+
+![[projects/auto-tuner/Heterogenous Parallelism Cluster.figs/260410-105227.png]]
+
+2. https://www.infracloud.io/blogs/inference-parallelism
+3. [Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI](https://arxiv.org/abs/2507.11830v1)
+
+
+
+---
+
+可能进一步的探索方向：
+当时的工作不涉及 EP，在 EP 下会有什么新的挑战？
+reconfiguration 的 overhead 不可忽视，本工作主要是用了一些 well-known 的常规优化。那么在 EP 下，reconfiguration 会存在什么挑战？
+
+
+> 鉴于视觉和语言任务之间存在MoE架构的性能差异
+
+为什么存在差异？
+
+小模型的结论能不能迁移到大模型
+
+
+---
+
+## Materials
+
+https://www.perplexity.ai/hub/blog/lower-latency-and-higher-throughput-with-multi-node-deepseek-deployment
+
+
+
+## Dev TBD
+
+PP 与 DP 同时 enable 时 vllm 会异常退出，为什么？
+看起来是 vllm 的 bug，切回 v0.10.0 的 release 分支仍能稳定触发
+
+![[projects/auto-tuner/Heterogenous Parallelism Cluster.figs/260410-105227-1.png]]
+
+单机/跨机内的 DP 是怎么通信的？看起来需要 tcp，ZMQ 如何 work 的？
+
--- a/projects/auto-tuner/List.md
+++ b/projects/auto-tuner/List.md
@@ -0,0 +1,3 @@
+https://halide-lang.org
+https://tvm.apache.org
+
--- a/projects/auto-tuner/Ongoing.md
+++ b/projects/auto-tuner/Ongoing.md
@@ -0,0 +1,107 @@
+## vLLM DBO 代码结构
+
+```python
+class UBatchWrapper:
+    def __init__(
+        self,
+        runnable: Callable,
+        vllm_config: VllmConfig,
+        runtime_mode: CUDAGraphMode,
+        device: torch.cuda.device,
+    ):
+        self.runnable = runnable
+        self.vllm_config = vllm_config
+        self.compilation_config = vllm_config.compilation_config
+        self.comm_stream = torch.cuda.Stream(device=device)
+        # Two ubatch threads plus the main thread
+        self.ready_barrier = threading.Barrier(3)
+
+        self.cudagraphs: dict[int, CUDAGraphMetaData] = {}
+
+        self.cudagraph_wrapper = None
+        self.graph_pool = None
+        if runtime_mode is not CUDAGraphMode.NONE:
+            self.cudagraph_wrapper = CUDAGraphWrapper(
+                runnable, vllm_config, runtime_mode=runtime_mode
+            )
+            self.graph_pool = current_platform.get_global_graph_pool()
+
+        self.sm_control = self._create_sm_control_context(vllm_config)
+        self.device = device
+```
+
+
+
+https://github.com/vllm-project/vllm-ascend/issues/2599
+
+in vllm, we can search PR for `[Core/DBO]`
+
+## notes
+
+> Could you provide some performance improvement data? I tested DeepSeek V2 Lite locally and observed a negative performance gain, with the per-step latency increasing from 38ms to 49ms. The process of launching vLLM and the test results are shown below.
+
+> According to the Nsys profile data, after enabling DBO, the execution time of both kernel batched_triton_kerneland vllm::act_and_mul_kernelhas increased significantly.
+
+Yes this is expected; DBO will increase the GEMM time when running a memory bound workload since the full model weights will have to be loaded twice (once for each microbatch). So DBO is only really beneficial when the communication time is >1x GEMM time; so it's really only intended to be used in multi-node EP setup where the communications costs are much higher. Its not expected to provide speed-up in a single node environment.
+
+
+
+
+
+
+
+
+
+
+
+
+
+# kernels
+
+```
+ampere_bf16_s16816gemm_bf16_128x128_ldg8_f2f_stages_64x3_tn
+ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_64x4_tn
+ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
+ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x6_tn
+fused_moe_kernel
+std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kernel<c10::BFloat16, (int)8>(T1 *, T1 *, const T1 *, float, int, int)
+void at::native::<unnamed>::cunn_SoftMaxForward<(int)4, float, float, float, at::native::<unnamed>::SoftMaxForwardEpilogue>(T4 *, const T2 *, int)
+void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::uniform_and_transform<float, float, at::CUDAGeneratorImpl *, void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T3, T4)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, float4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::uniform_and_transform<float, float, at::CUDAGeneratorImpl *, void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T3, T4)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(long, at::PhiloxCudaState, T3, T4)
+void at::native::<unnamed>::indexSelectLargeIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<const T1, T3>, at::cuda::detail::TensorInfo<const T2, T3>, int, int, T3, T3, long)
+void at::native::<unnamed>::indexSelectSmallIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2, (int)-2>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<const T1, T3>, at::cuda::detail::TensorInfo<const T2, T3>, int, int, T3, long)
+void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::DivFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 12)]::operator ()() const::[lambda(c10::BFloat16) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void at::native::index_kernel_impl<at::native::OpaqueType<(int)2>>(at::TensorIteratorBase &, c10::ArrayRef<long>, c10::ArrayRef<long>)::[lambda(char *, const char *, long) (instance 1)]>(at::TensorIteratorBase &, c10::ArrayRef<long>, c10::ArrayRef<long>, const T1 &)::[lambda(int) (instance 1)]>(long, T3)
+void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<c10::BFloat16, at::native::func_wrapper_t<c10::BFloat16, at::native::sum_functor<c10::BFloat16, float, c10::BFloat16>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, c10::BFloat16, (int)4, (int)4>>(T3)
+void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::ArgMaxOps<float>, unsigned int, long, (int)4, (int)4>>(T3)
+void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4, (int)4>>(T3)
+void at::native::unrolled_elementwise_kernel<at::native::CUDAFunctorOnSelf_add<int>, std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>, (int)8, TrivialOffsetCalculator<(int)0, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<long>, std::array<char *, (unsigned long)1>, (int)8, TrivialOffsetCalculator<(int)0, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 3)]::operator ()() const::[lambda(int) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(long) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<long>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, float, at::native::binary_internal::DivFunctor<float>>, std::array<char *, (unsigned long)3>>(int, T2, T3)
+void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+void at::native::vectorized_elementwise_kernel<(int)8, at::native::FillFunctor<c10::BFloat16>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, __nv_bfloat16, __nv_bfloat16, float, (bool)0, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, (bool)1, (bool)0, (bool)0>(cublasLt::cublasSplitKParams<T6>, const T4 *, const T9 *, T8 *, T5 *, const T6 *, const T6 *, const T10 *, const T4 *, T10 *, void *, long, T6 *, int *, T6 *, const T6 *, const T6 *, const T6 *, const T6 *)
+void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, float, __nv_bfloat16, float, (bool)0, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, (bool)1, (bool)0, (bool)0>(cublasLt::cublasSplitKParams<T6>, const T4 *, const T9 *, T8 *, T5 *, const T6 *, const T6 *, const T10 *, const T4 *, T10 *, void *, long, T6 *, int *, T6 *, const T6 *, const T6 *, const T6 *, const T6 *)
+void cutlass::Kernel2<cutlass_80_tensorop_bf16_s16816gemm_relu_bf16_64x64_64x6_tn_align8>(T1::Params)
+void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm_bf16_64x64_64x6_tn_align8>(T1::Params)
+void cutlass::Kernel2<cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_128x2_tn_align8>(T1::Params)
+void cutlass::Kernel2<cutlass_80_wmma_tensorop_s161616gemm_bf16_32x32_128x1_tn_align8>(T1::Params)
+void flash::flash_fwd_splitkv_combine_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (int)4, (int)1, (bool)1>(flash::Flash_fwd_params)
+void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)0, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)0, (bool)0>(flash::Flash_fwd_params)
+void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)0, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)1, (bool)0>(flash::Flash_fwd_params)
+void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)1, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)0, (bool)0>(flash::Flash_fwd_params)
+void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, const T1 *, int)
+void vllm::moe::count_and_sort_expert_tokens_kernel<int>(const T1 *, int *, int *, unsigned long)
+void vllm::moe::moe_align_block_size_kernel<int>(const T1 *, int *, int *, int *, int, int, int, int, unsigned long, int *)
+void vllm::moe::topkGatingSoftmax<(int)4, (int)128, (int)4, (int)16, int>(const float *, const bool *, float *, int, T5 *, int *, int, int, int)
+void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0>(const T1 *, const T1 *, T2 *, T2 *, const long *, long, long, long, long, long, int, int, int, const float *, const float *)
+void vllm::rms_norm_kernel<c10::BFloat16>(T1 *, const T1 *, const T1 *, float, int, int)
+void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, int, long, long, long, int, int, int)
+```
--- a/projects/auto-tuner/Roadmap.md
+++ b/projects/auto-tuner/Roadmap.md
@@ -0,0 +1,7 @@
+我们的目标：
+
+对上向模型提供一套抽象 IR，能够以接近 0 成本的方式适配新模型，将模型编译至 IR.
+
+针对我们定义的 IR，可以在 IR 层面对模型的分布式部署策略、推理负载动态性等做抽象的优化，得到优化方案适配不同的后端推理框架。
+
+
--- a/projects/auto-tuner/Sync.figs/260410-105227-1.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-1.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-2.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-2.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-3.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-3.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-4.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-4.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-5.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-5.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-6.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-6.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-7.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-7.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-8.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-8.png
--- a/projects/auto-tuner/Sync.figs/260410-105227-9.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227-9.png
--- a/projects/auto-tuner/Sync.figs/260410-105227.png
+++ b/projects/auto-tuner/Sync.figs/260410-105227.png
--- a/projects/auto-tuner/Sync.md
+++ b/projects/auto-tuner/Sync.md
@@ -0,0 +1,353 @@
+## 0805
+
+在 vLLM latest 上测试不同的 parallelism configuration，使用 (input_length, output_length) = (1024, 128), (128, 1024) 两种组合，使用 Qwen3-30B-A3B。
+
+可以观察到：
+- 输入输出的长度比例不同，确实会体现出并行模式的亲和性
+- 即使 vLLM 支持了 DeepEP，小 EP 下的效果确实相比于不使用 EP 没什么提升，甚至可能变差。[perplexity blog](https://www.perplexity.ai/hub/blog/lower-latency-and-higher-throughput-with-multi-node-deepseek-deployment)
+- 在长输入短输出的情况下，DP 开大之后，性能明显变差
+- 在短输入长输出的情况下，PP 开大之后，性能明显变差
+- vLLM 每次换一个 configuration 重启，即使是现在 30B 的模型，时间开销也在 6min 左右，如果需要 configuration 的动态切换，简单重启的方式可能开销过大
+
+(input_length, output_length) = (128, 1024)
+
+![[projects/auto-tuner/Sync.figs/260410-105227.png]]
+![[projects/auto-tuner/Sync.figs/260410-105227-1.png]]
+
+(input_length, output_length) = (1024, 128)
+
+![[projects/auto-tuner/Sync.figs/260410-105227-2.png]]
+![[projects/auto-tuner/Sync.figs/260410-105227-3.png]]
+
+
+
+
+存在的问题：
+- 大模型下的结论如何验证？
+- vLLM latest 自带 bug，PP+DP 会炸掉
+
+
+ Feedback
+- 详细看一下 vLLM 对不同 parallelism 的实现，性能区别的原因
+- 思考小模型如何迁移到大模型，结论如何 scale (hardest)
+- 测试真实 trace 和 micro benchmark，比较区别和关联
+- 控制变量，定性的给出更多结论
+- 理论建模/实验分析一些 trade-off
+	- e.g. EP 下大 EP 下 bubble 变小，但需要更大的通信量，需要更大的 batch size，降低了 latency。通信量和计算 bubble，throughput 和 latency
+
+
+---
+## 0812
+
+1. 了解 vLLM 对不同 parallelism 的实现
+	结论：vLLM 的很多实现完全看不出道理
+	例如：
+	- PP 在目前的实现上，维护 PP 个 virtual engine，每个调度一个 micro batch 塞给 executor 执行，即使不考虑不同 micro batch 之间的 balance 问题，也有大量 bubble。比如 PP=4，则每个 step:
+		```
+		| b0 | b1 | b2 | b3 | xx | xx | xx |
+		| xx | b0 | b1 | b2 | b3 | xx | xx |
+		| xx | xx | b0 | b1 | b2 | b3 | xx |
+		| xx | xx | xx | b0 | b1 | b2 | b3 |
+		```
+	- DP 的 JSQ 里，使用的 cost 定义为一个 pair `(num_waiting, num_running)`，为什么不是 `num_waiting + num_running`
+2. 不同 parallelism configuration 的 search 与控制变量的测试脚本与测试，尚未做数据分析 [suspended]
+3. 对于大模型的不同 parallelism config 来说，没有那么多硬件来 profile，希望有一个 fidelity 好的 simulator，最简单的想法是对 kernel 做 profile，根据 profile 结果做 search => 发现现有工作 [Vidur](https://arxiv.org/pdf/2405.05465)。
+	Vidur 现存的问题：没有考虑 EP；需要离线对一组 trace profile 出一组 config
+	![[projects/auto-tuner/Sync.figs/260410-105227-4.png]]
+4. 并行模式的新探索
+	之前的视角是，线上的 workload 是多样的，我们可能需要使用不同的并行部署模式来分别服务。
+	新的视角是，百炼平台上的 model 也是多样的，多样化的模型（Qwen-30b/235b/480b, Wanx, DeepSeek-671b, Kimi-K2, ...）、多样的硬件（A800, H800, H20, ...），是否存在可能将不同的模型的不同部分（attention/MoE/...），在不同硬件上做并行。这个视角下，我们对下将硬件抽象为不同的资源能力实现资源池的概念，对上为模型提供更多灵活组合的可能性。
+	存在的问题：为什么非要混？在 PD/AF-disaggregation 的情况下，每个模型继续物理意义隔离有什么问题？
+	共性：需要一个对 kernel/component 的 profiler，根据 profile 的结果，至少能够理论上计算得到 overlap 的机会，不同模块使用不同硬件的能够带来的理论提升空间。
+
+Feedback
+- 批流系统的 idea => LLM 计算图的编排
+- 粒度：model -> layer -> component -> kernel
+
+
+---
+## 0819
+
+### vLLM profile
+
+1. Python sucks!
+	`<built-in method __getitem__ of dict object at 0x7faf7c36a1c0>` 导致 GPU bubbles
+	 ![[projects/auto-tuner/Sync.figs/260410-105227-5.png]]
+	 ![[projects/auto-tuner/Sync.figs/260410-105227-6.png]]
+
+2. 目前 vLLM 推理的 kernel 已经比较高度的 fused，每一层 layer 的 forward 大概只有一次 flash attention 的调用，和一个大的 FFN 的 CUDA graph 调用
+
+
+### streaming system v.s. LLM inference system
+
+[FlexFlow](https://arxiv.org/pdf/1807.05358)
+- DNN 训练场景的 parallelism search with space: Sample, Operation, Attribute, and Parameter.
+- uses a MCMC search algorithm that proposes a new parallelization strategy by changing the parallelization configuration of a single operation in the previous strategy
+
+[APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving](https://arxiv.org/pdf/2411.17651)
+- 抽象出 transformer IR
+- 相比 Vidur 支持 EP
+- 三个组件：IR Converter and Transformer IR, Parallel Templates and Parallel Schemes Generator, Device Mapper
+
+
+关于计算图排布，可以看到大家目前在 search 时采用的方式都会采用 template，现在不同的 parallelism 本质也是一种专家发现的 template，用于减少 search space。
+
+
+|     | streaming system                                                                                | LLM inference system                                       |
+| --- | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
+|     | 持续更新计算结果，需要保存中间状态 [1]                                                                           | 需要保存 KVCache                                               |
+|     | 重建中间状态需要重新处理整个 stream history [1]                                                               | 如果丢失需要重算 KVCache                                           |
+|     | dataflow streaming mode: V 表示 operator，E 表示 data stream，E 中有 forward, broadcast, shuffle, keyBy | V 表示 kernel，E 表示 communication，那么 E 同样有 forward, broadcast |
+|     | 通过 top-down 将整个计算图 split，得到 operator fusion [2]                                                 | pre-compiled cuda graph 也是 operator fusion                 |
+|     |                                                                                                 |                                                            |
+|     | 根据 workload 动态调整计算图 [3]                                                                         | 根据 workload/输入 batch 的 pattern 可能也需要动态调整计算图，如何实现？          |
+|     | 计算图是固定的，更多考虑的是如何 fuse，如何分区                                                                      | 计算图本身不固定，可选的算子是多样的，并行模式是多样的，输入的 pattern 是多样的               |
+|     | 通信是一个相对更简单的 dataflow，可以明确划分出 layers，通信只在 layers 之间进行                                            | 通信的模式更多样，layer 内的 AR, A2A，layer 之间的通信                      |
+|     |                                                                                                 |                                                            |
+1) [A Survey on the Evolution of Stream Processing Systems](https://arxiv.org/pdf/2008.00842)
+2) [COLA: Optimizing Stream Processing Applications Via Graph Partitioning](https://dl.ifip.org/db/conf/middleware/middleware2009/KhandekarHPRWWAG09.pdf)
+3) [Adaptive Distributed Partitioning in Apache Flink](https://datalab.csd.auth.gr/static/people/gounaris/2020SMDB.pdf)
+4) [StreamScope: Continuous Reliable Distributed Processing of Big Data Streams](https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-lin-wei.pdf)
+5) [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://www.usenix.org/system/files/osdi22-zheng-lianmin.pdf)
+
+
+问题：
+- 计算图的自动编排与基于模拟器的 search 有什么区别
+- 如何选择合适的 search space，例如 micro batch 拆分之类的如何考虑
+- 输入的高度 variety，不确定在拆分到不同长度的 batch 组合时是否会具有相同的计算性质
+
+### Thinking
+
+本质来说，我们的工作是去提出更好的抽象从而做更好的 search？
+
+LLM structure（如 transformer）作为一个程序的话，GPU kernel 就是一种 asm instruction。我们要做的事情是：
+- choose the best kernel (asm instruction) with lowest cost
+- run kernel (asm instruction) with best overlap
+Input: LLM structure, all usable kernels
+Output: LLM compute flow
+
+### Feedback
+
+- trace vLLM CUDA compute flow => native (Rust) framework
+
+
+
+---
+## 0826
+
+工程实现与学习，跑通 demo 可以将 triton -> ptx -> Rust load and launch CUDA kernel
+
+vLLM 中大多数为 triton/flashinfer 等 lib，都支持 jit，可以使用 jit_module 进行 build & dump
+
+如何通过 vLLM 做 inference 的计算图，duplicate 一份在一个 native 的 Rust framework 上做？
+
+挑战：
+- 如何捕获一个计算图的 kernel flow 和 data flow
+- 对于不同的硬件、应对不同 input 的不同 size 的 kernel，如何自动化
+
+
+Feedback
+
+(硬件，model，workload) ---人工优化---> setup
+总结 search parallelism 的 work
+一个 model 的不同量化版本之间的优化方案是否通用？
+matrix 384 的优化能不能搜出来
+
+
+---
+## 0922
+
+1. 基于 Rust+candle 跑通了最基本的 LLM 推理，但是存在精度问题（decode 长了之后吐词会崩溃），即使是使用 candle-example 的官方示例也是如此
+
+现有框架的问题：
+- 缺少统一的抽象，适配新算子、新模型都需要大量的人力成本
+- 分布式的推理优化不能提供统一指导（分布式拆分、不同硬件之间的任务分发、不同并行模式）
+- 冷启动慢：PyTorch、CUDA graph
+- Python 自带的问题：信息的缺失引入的二次开发的困难
+
+
+实现 native 框架的研究挑战：
+- 对现有大模型推理算子图的提取，一比一的复刻
+- 如何给一个更好的统一的抽象
+- 如何自动化切分任务适配不同硬件、如何自动化并行
+
+工程挑战：
+- Rust 下适配各类通信库可能缺乏生态
+- 不同架构下可能存在的数据精度等细节问题
+
+
+---
+## Ongoing
+
+### vLLM parallelism 实现细节
+
+vLLM handles PP bubbles through several key mechanisms:
+
+  1. Virtual Engines: Uses virtual engine IDs to manage micro-batches and enable overlapping computations across pipeline stages
+  2. Efficient Communication:
+    - Uses IntermediateTensors to pass only necessary data between stages
+    - Implements send_tensor_dict/recv_tensor_dict for optimized communication
+    - Overlaps communication with computation to reduce idle time
+  3. Pipeline Coordination:
+    - Carefully manages first/last pipeline stages differently
+    - Initializes pipeline groups for optimal communication patterns
+    - Scheduler handles token passing between stages to prevent round-trips
+  4. Key Files:
+    - vllm/distributed/parallel_state.py: Core pipeline group coordination
+    - vllm/worker/worker_base.py: Pipeline execution flow implementation
+    - vllm/worker/model_runner.py: Pipeline-aware model execution
+    - vllm/v1/core/sched/scheduler.py: Pipeline-parallel scheduling logic
+
+  The system minimizes bubbles through smart scheduling, efficient tensor communication, and overlapping computation/communication, though there's still a TODO for supporting overlapping micro-batches.
+
+vLLM-v0
+
+self.scheduler 维护了 virtual engines 的 scheduler 列表，为什么新请求能被加到 min_cost_scheduler 而不是第一个，然后按顺序做？
+> By having equal cache engines that are separate we create the conditions for the pipeline stages to be even (especially if we enable chunked prefill) provided the cost function for adding requests to schedulers is defined appropriately.
+
+PP 的 scheduler 和 executor 是独立的两个东西。有 pp 个 virtual engines，每个有一个 scheduler，新请求向 cost 最小的 scheduler 添加。AsyncLLM 每一次 step 时，会对每个 virtual engine 调用 schedule，得到即将执行的请求喂给整个 AsyncLLM 维护的唯一一个 executor，也就是说这一个 executor 会在一次 step 中做 pp 次 execute_model，从而保证一次 step 确实有 token 吐出来。这个角度来说，min_cost_scheduler 的含义十分不明。
+![[projects/auto-tuner/Sync.figs/260410-105227-7.png]]
+
+vLLM-v1
+
+![[projects/auto-tuner/Sync.figs/260410-105227-8.png]]
+
+https://github.com/vllm-project/vllm/issues/4461
+https://github.com/vllm-project/vllm/issues/11945
+
+
+
+### 测试规划
+
+```python
+from tabulate import tabulate
+
+lst = [
+    # 4 cards
+    "1 4 1 800",
+    "2 2 1 800",
+    "4 1 1 800",
+    "2 1 2 800",
+    # 8 cards
+    "1 8 1 1600",
+    "2 4 1 1600",
+    "4 2 1 1600",
+    "8 1 1 1600",
+    "4 1 2 1600",
+    "2 1 4 1600",
+
+    # illegal config
+    # "1 2 2 800",
+    # "2 2 2 1600",
+    # "1 4 2 1600",
+    # "1 2 4 1600",
+    # "1 1 8 1600",
+]
+
+headers = ["IO", "TP", "DP", "PP", "EP", "qps", "num_requests"]
+
+data = []
+for input_len, output_len in [(1024, 128), (128, 1024)]:
+    for enable_ep in [True, False]:
+        for cfg in lst:
+            tp, dp, pp, num_requests = map(int, cfg.split())
+            num_requests = num_requests // 800 * 1000
+            ep = tp * dp * pp
+
+            if (input_len, output_len) == (1024, 128):
+                if ep == 4:
+                    qps_list = [10, 11, 12, 13, 14, 15, 16]
+                if ep == 8:
+                    qps_list = [20, 22, 24, 26, 28, 30, 32]
+            if (input_len, output_len) == (128, 1024):
+                if ep == 4:
+                    qps_list = [2, 3, 4, 5, 6, 7, 8]
+                if ep == 8:
+                    qps_list = [4, 6, 8, 10, 12, 14, 16]
+
+            for qps in qps_list:
+                data.append([f"{input_len}+{output_len}", tp, dp, pp, ep if enable_ep else 0, qps, num_requests])
+
+
+print(tabulate(data, headers=headers, tablefmt="github"))
+```
+
+ TP=2 DP=1 PP=4 EP=8 下每次启动两组测试后就会挂掉
+ TP=2 DP=1 PP=4 EP=0 同样的问题
+
+
+测试脚本的问题：
+Popen 启动的 vllm server 会缓冲区阻塞，导致无法继续执行
+方法 1:
+```python
+server_process = subprocess.Popen(
+	["./run.sh"],
+	stdout=subprocess.DEVNULL,
+	stderr=subprocess.DEVNULL,
+)
+```
+方法 2:
+```python
+with open("server.log", "w") as log_file:
+    server_process = subprocess.Popen(
+        ["./run.sh"],
+        stdout=log_file,
+        stderr=log_file
+    )
+```
+
+
+
+### TBD
+
+[Vidur: A Large-Scale Simulation Framework For LLM Inference](https://arxiv.org/abs/2405.05465)
+[Llumnix: Dynamic Scheduling for Large Language Model Serving](https://arxiv.org/abs/2406.03243)
+[Frontier: Simulating the Next Generation of LLM Inference Systems](https://arxiv.org/abs/2508.03148)
+SimAI 也有在支持对 inference 的 simulator，差异化是什么
+
+
+---
+> Tao: 主要现在没有一个能驱动资源调度，同时考虑不同配置的请求调度的框架
+
+
+
+
+- nanoflow
+- fork in the road
+- 咨询涛老师，线上扩缩容实际的 bottleneck
+
+
+1. 快速扩容的工业界落地（先明确百炼是否需要快速扩容，有数据吗？）
+	参考 paper fork in the road: https://www.usenix.org/system/files/osdi25-chai-xiaohu.pdf
+	
+	目标：兼容主流推理框架（如vLLM）的基础上实现极速扩容
+	目前已知的技术点：
+	- kvcache初始化加速：docker pause & unpause
+	- cuda graph 初始化加速：Medusa: Accelerating Serverless LLM Inference with Materialization.
+	- 参数加载加速：BlitzScale
+
+2. 分析并行模式
+	- 首先我们得先回答，一个系统的一种并行模式是否是最优的？这里可以参考：nanoflow https://arxiv.org/pdf/2408.12757v2
+		- nanoflow 没有考虑的点：inter-node 的优化、EP 下的复杂通信状况、KVCache 引起的复杂 context 环境
+	- 在这个基础上，我们是否可以发现，1）现有框架是否做的不够好； 2）在最优性能下，是否会改变之前的一些比较？
+
+3. 使用小模型的 GPU bubble 运行部分大模型
+	AlpaServe: https://www.usenix.org/system/files/osdi23-li-zhuohan.pdf
+	为什么不直接在一张卡上运行多个小模型，实现大小模型的物理隔离，还能减少通信量
+	![[projects/auto-tuner/Sync.figs/260410-105227-9.png]]
+
+
+
+
+### 补充信息
+
+30B 模型，load weights：15s (cached)，torch compile：180s，Dynamo bytecode transform 30s
+
+
+### 0820 ~
+
+> Candle's core goal is to _make serverless inference possible_. Full machine learning frameworks like PyTorch are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight binaries.
+
+native framework 的优势：可能还能优化冷启动时间
+
+
--- a/projects/auto-tuner/TODO.md
+++ b/projects/auto-tuner/TODO.md
--- a/projects/auto-tuner/Theoretical
+++ b/projects/auto-tuner/Theoretical
@@ -0,0 +1,89 @@
+### prefill
+
+对于 attention：
+
+对一个序列长度为 S 的 transformer attention（多头、按 d = H * head_dim）：
+
+1. **Q/K/V 投影**（一次性做 3 个线性变换）：
+    
+    FLOPsQKV_proj≈2×S×d×(3d)=6Sd2FLOPsQKV_proj≈2×S×d×(3d)=6Sd2
+    
+    （矩阵乘法的常用近似：2·m·n·k）
+    
+2. **Attention 矩阵乘（Q·K^T）**：
+    
+    FLOPsQK≈2×H×S×dhead×S=2S2dFLOPsQK≈2×H×S×dhead×S=2S2d
+    
+    因为 Hdhead=dHdhead=d。
+    
+3. **Attention·V（权重与 V 相乘）**：
+    
+    FLOPsAV≈2S2dFLOPsAV≈2S2d
+4. **输出投影**（从 heads 拼回 d 再线性变换）：
+    
+    FLOPsout≈2Sd2FLOPsout≈2Sd2
+
+总 FLOPs：$8Sd^2 + 4S^2d$
+
+$$
+T_{\text{comp}} = \frac{\text{FLOPs}_{\text{per\_GPU}}}{\text{peak\_flops\_per\_GPU} \times \text{compute\_utils}}
+$$
+
+
+总 memory：$\text{bytes}_\text{prefill} \approx N \cdot \alpha \cdot BLd \cdot \text{elem\_bytes}$，$\alpha \sim 6$
+
+
+$$
+T_{\text{mem}} = \frac{\text{bytes}_{\text{per\_GPU}}}{\text{bandwidth\_per\_GPU} \times \text{mem\_utils}}
+$$
+
+
+
+
+### decode
+
+总 FLOPs：$8d^2 + 4dL$
+
+总 memory：$\text{bytes}_\text{decode} \approx N \cdot \beta \cdot BLd \cdot \text{elem\_bytes}$，$\beta \sim 4$
+
+
+
+$\text{output} = \text{SiLU}(xW_1)W_2$
+
+TP 下，每 token $T$ 激活的 expert $E$ 的通信为：
+1. 输入 $x$ AllGather 到所有 TP 节点，通信量：hidden_size * (TP - 1)
+2. 每个 TP 节点独立计算 $xW_1'$
+3. AllGather 后每个节点得到完整的 $xW_1$，通信量：moe_intermediate_size / TP * (TP - 1)
+4. 每个节点计算 SiLU 和 $IW2'$，AllReduce 每个节点得到完整的 output，通信量：hidden_size * (TP - 1)
+
+
+EP 下：dispatch+combine
+
+2 * hidden_size * (EP - 1) / EP （假设负载均衡）
+
+
+
+---
+With batch_size=2000, seq_len=2048, EP=8
+
+Qwen-235B:
+attention comp time 0.06944874306412531
+moe combine comm time 0.00028672
+moe comp time 0.0004069259060131379
+moe comm time with TP 0.00045056
+
+Qwen-30B:
+attention comp time 0.01736218682573421
+moe combine comm time 0.00014336
+moe comp time 0.00010173147650328448
+moe comm time with TP 0.00022528
+
+
+EP=64:
+Qwen-235B:
+attention comp time 0.06944874306412531
+moe combine comm time 0.00032256
+moe comp time 0.0004069259060131379
+moe comm time with TP 0.00045056
+
+
--- a/projects/auto-tuner/Untitled
+++ b/projects/auto-tuner/Untitled
@@ -0,0 +1,732 @@
+你这个批评是对的。
+上一个版本把问题讲成了“单请求 service time + 泛泛 queueing”，但对 vLLM 这类 **continuous batching / bulk-service** 系统来说，这样建模不够锋利。
+
+## 先把问题修正：正确的服务对象不是 request，而是 batch
+
+对 TTFT 来说，真正被执行的是 **prefill batch**，不是单个请求。
+因此更准确的建模方式是：
+
+* 先定义 **batch runtime** $T_b(t)$
+* 再定义 request 的 TTFT 是“排队等到自己所在 batch 被执行”加上“自己所在 batch 的执行时间”
+
+而不是先写一个模糊的 $S_{\text{comp}} + S_{\text{comm}} + S_{\text{sched}}$。
+
+---
+
+## 一、最小而精确的建模框架
+
+设：
+
+* 总 GPU 数固定为 $G$
+* TP 为 $t$
+* 则可部署 replica 数为
+
+$$
+m_t = \frac{G}{t}
+$$
+
+对某个 replica，上面的第 $b$ 个 prefill batch 记为 $\mathcal{B}_b$。
+
+对 batch 中每个请求 $i$：
+
+* prompt length 为 $x_i$
+* 该 batch 的总 prefill token 数为
+
+$$
+Z_b = \sum_{i \in \mathcal{B}_b} x_i
+$$
+
+---
+
+## 二、batch runtime $T_b(t)$ 应该怎么写
+
+### 2.1 总式子
+
+更准确地，应该写成：
+
+$$
+T_b(t) = T_b^{\text{comp}}(t) + T_b^{\text{comm}}(t) + T_b^{\text{rt}}(t)
+$$
+
+其中：
+
+* $T_b^{\text{comp}}(t)$：真正算子计算时间
+* $T_b^{\text{comm}}(t)$：TP collective 通信时间
+* $T_b^{\text{rt}}(t)$：runtime 固定开销，例如 launch gap、executor 调度、图切换、kernel 边界等
+
+这里我故意不用模糊的 $S_{\text{sched}}$，而把它收敛为 **runtime overhead residual**。
+因为对这类系统，“调度”大量体现为 **batch 之间的时间缝隙和执行边界成本**，而不是某个独立的物理项。
+
+---
+
+## 三、$T_b^{\text{comp}}(t)$ 到底怎么算
+
+### 3.1 prefill 的计算量不是只和 $\sum x_i$ 成正比
+
+对 decoder-only Transformer，一个 prompt length 为 $x$ 的请求，其 prefill 计算量可以抽象成：
+
+$$
+F(x) = a x d^2 + b x^2 d
+$$
+
+其中：
+
+* $d$ 是 hidden size
+* $a x d^2$ 对应 dense 部分：QKV 投影、O 投影、MLP 等
+* $b x^2 d$ 对应 self-attention 部分
+* $a,b$ 吸收了层数 $L$、MLP expansion ratio、head 结构等常数
+
+因此，一个 batch 的总 FLOPs 是：
+
+$$
+F_b = \sum_{i \in \mathcal{B}*b} F(x_i)
+= a d^2 \sum*{i \in \mathcal{B}*b} x_i + b d \sum*{i \in \mathcal{B}_b} x_i^2
+$$
+
+如果把层数 $L$ 显式写出来，则是：
+
+$$
+F_b = L \left( a' d^2 \sum_{i \in \mathcal{B}*b} x_i + b' d \sum*{i \in \mathcal{B}_b} x_i^2 \right)
+$$
+
+这比“$S_{\text{comp}}$ 是计算项”要具体得多，因为这里直接告诉你：
+
+> **prefill batch 的成本对长度分布是凸的，关键不是只有总 token 数 $Z_b$，而是还取决于 $\sum x_i^2$。**
+
+---
+
+### 3.2 一个非常关键的结论：length variance 直接进入 batch cost
+
+因为：
+
+$$
+\sum_{i \in \mathcal{B}_b} x_i^2
+= n_b \left( \bar{x}_b^2 + \operatorname{Var}_b(x) \right)
+$$
+
+其中：
+
+* $n_b = |\mathcal{B}_b|$
+* $\bar{x}_b$ 是 batch 内平均 prompt length
+* $\operatorname{Var}_b(x)$ 是 batch 内长度方差
+
+所以在固定 $n_b$ 和固定平均长度 $\bar{x}_b$ 下：
+
+* 方差越大
+* $\sum x_i^2$ 越大
+* prefill compute cost 越高
+
+这就是一个很强的、可计算的 insight：
+
+> **heterogeneity 不只是“调度难”，而是直接提高了 attention 的物理计算量。**
+
+这也解释了为什么 coder window 往往更敏感：
+它通常不是只有更高平均长度，而是更高的 $\operatorname{Var}(x)$ 和更高的长尾。
+
+---
+
+### 3.3 把 FLOPs 映射成计算时间：roofline 形式
+
+给定 batch $b$ 和 TP=$t$，其计算时间可以写成：
+
+$$
+T_b^{\text{comp}}(t) = \max\{\frac{F_b}{\Pi_t}, \frac{Q_b}{B_t}\}
+$$
+
+其中：
+
+* $\Pi_t$ 是 TP=$t$ 时该 instance 的有效算力
+* $Q_b$ 是这个 batch 的总 memory traffic
+* $B_t$ 是 TP=$t$ 时该 instance 的有效 HBM 带宽
+
+再进一步写：
+
+$$
+\Pi_t = t \Pi_1 \eta_t^{\text{comp}}
+$$
+
+$$
+B_t = t B_1 \eta_t^{\text{mem}}
+$$
+
+其中：
+
+* $\eta_t^{\text{comp}} \le 1$ 是 TP 扩大后由于 GEMM shape 变化、kernel utilization 降低带来的效率损失
+* $\eta_t^{\text{mem}} \le 1$ 是内存系统扩展效率
+
+于是：
+
+$$
+T_b^{\text{comp}}(t)
+=
+\max
+\{
+\frac{F_b}{t \Pi_1 \eta_t^{\text{comp}}},
+\frac{Q_b}{t B_1 \eta_t^{\text{mem}}}
+\}
+$$
+
+这已经比“$S_{\text{comp}}$ 会随 TP 降低”精确很多了。
+它明确说明：
+
+* 理想情况下，compute/memory 时间会按 $1/t$ 降
+* 但实际只能按 $1/(t \eta_t)$ 降
+* 而 $\eta_t$ 通常随 $t$ 变大而下降，所以是 **次线性加速**
+
+---
+
+## 四、$T_b^{\text{comm}}(t)$ 到底怎么算
+
+### 4.1 TP 的通信本质是 activation collective
+
+对标准 tensor parallel，每层通常会有若干个 activation 同步。
+如果用 ring all-reduce 近似，大小为 $n$ bytes 的一次 collective 时间是：
+
+$$
+T_{\text{AR}}(n,t)
+=
+2 (t-1) \alpha
+
+2 \frac{t-1}{t} \frac{n}{\beta}
+$$
+
+其中：
+
+* $\alpha$ 是每 hop latency
+* $\beta$ 是链路带宽
+
+如果每层平均有 $k$ 次这样的 collective，且每次 payload 与 batch activation 大小成正比：
+
+$$
+n_b \approx q d Z_b
+$$
+
+其中 $q$ 是 dtype bytes，那么：
+
+$$
+T_b^{\text{comm}}(t)
+\approx
+L k
+\left(
+2 (t-1)\alpha
+
+2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
+\right)
+$$
+
+这就是你要的更具体形式。
+
+---
+
+### 4.2 这个式子带来的核心含义
+
+它说明 TP 通信开销有两个部分：
+
+#### 固定项
+
+$$
+2 (t-1)\alpha
+$$
+
+这是 latency-driven 的，batch 小时特别痛。
+
+#### 线性 payload 项
+
+$$
+2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
+$$
+
+这是 batch token 数越大越重。
+
+所以：
+
+* 低负载、小 batch 时，通信固定项不一定主导，但也不会消失
+* 高负载、大 batch 时，payload-driven 通信会上升
+* 更重要的是：**即使 compute 时间下降，communication 不会按 $1/t$ 同步下降**
+
+这也是 TP 速度提升无法线性的根本原因。
+
+---
+
+## 五、把三项合起来：一个真正可用的 $T_b(t)$
+
+综合起来，一个 batch 的执行时间可以写成：
+
+$$
+T_b(t)
+======
+
+\max
+\{
+\frac{
+L ( a' d^2 \sum_i x_i + b' d \sum_i x_i^2 )
+}{
+t \Pi_1 \eta_t^{\text{comp}}
+},
+\frac{Q_b}{t B_1 \eta_t^{\text{mem}}}
+\}
+
+L k
+(
+2 (t-1)\alpha
+
+2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
+)
+
+T_b^{\text{rt}}(t)
+$$
+
+这已经是一个相当具体的 mechanistic model 了。
+
+---
+
+## 六、TTFT 应该怎么精确定义，而不是模糊说 $W_q + E[S_t]$
+
+### 6.1 request 级别的精确定义
+
+对请求 $i$，设它被分配到某个 replica，并最终进入 prefill batch $b(i)$。
+它的 TTFT 精确写成：
+
+$$
+\mathrm{TTFT}_i(t)
+==================
+
+W_{q,i}(t) + T_{b(i)}(t)
+$$
+
+其中：
+
+* $W_{q,i}(t)$ 是它在自己 batch 真正开始执行前等待的时间
+* $T_{b(i)}(t)$ 是它所在 batch 的执行时间
+
+而 $W_{q,i}(t)$ 又可以精确分解为：
+
+$$
+W_{q,i}(t)
+==========
+
+R_i(t) + \sum_{u \in \mathcal{H}_i} T_u(t)
+$$
+
+其中：
+
+* $R_i(t)$ 是请求到达时当前正在跑的那个 batch 的 residual runtime
+* $\mathcal{H}_i$ 是它前面还排着的 prefill batches 集合
+
+这个式子是对的，而且比“queueing delay”四个字更有结构。
+
+---
+
+### 6.2 为什么上次写的 $E[S_t]$ 不够准确
+
+因为对 continuous batching 系统，**“request 的 service time”不是一个天然的一维随机变量**。
+
+更自然的随机变量其实是 batch runtime $T_b(t)$。
+request 的 TTFT 是：
+
+* 一个 residual batch
+* 加上若干个完整 batch
+* 再加上自己的 batch
+
+所以更对的期望写法是围绕 $T_b(t)$，而不是围绕某个抽象的 $S_t$。
+
+---
+
+### 6.3 如果一定要写平均 queueing term，应该怎么写
+
+在稳态下，如果把 batch 看成 renewal process，那么 residual life 的精确均值是：
+
+$$
+\mathbb{E}[R_t]
+===============
+
+\frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}
+$$
+
+这比“queueing increases with variance”更具体，因为它明确告诉你：
+
+> **batch runtime 的二阶矩直接进入等待时间。**
+
+如果再做一个独立性近似，则：
+
+$$
+\mathbb{E}[W_q(t)]
+\approx
+\frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}
+
+\mathbb{E}[|\mathcal{H}|] , \mathbb{E}[T_b(t)]
+$$
+
+其中 $|\mathcal{H}|$ 是到达时前方排着的 batch 数。
+
+这就已经比之前的 $E[S_t]$、$W_q(t,w)$ 精确得多了。
+
+---
+
+## 七、真正关键的量不是 $E[S_t]$，而是 cluster capacity
+
+如果你想分析“为什么高负载时小 TP 更好”，最关键的不是某个 request 的平均 service time，而是：
+
+> **固定总 GPU 数下，cluster 能提供的总服务能力到底是多少。**
+
+---
+
+### 7.1 单 instance 的 batch token throughput
+
+对 workload window $w$，定义 TP=$t$ 时单 instance 的平均 prefill token throughput 为：
+
+$$
+\mu_{t,w}
+=========
+
+\mathbb{E}
+\left[
+\frac{Z_b}{T_b(t)}
+\middle| w
+\right]
+$$
+
+这里：
+
+* $Z_b$ 是 batch token 数
+* $T_b(t)$ 是 batch runtime
+
+这是一个比 “$E[S_t]$” 更自然的核心量。
+
+---
+
+### 7.2 cluster 总 capacity
+
+因为总 replica 数是 $m_t = G/t$，所以 cluster 级别的总 prefill capacity 是：
+
+$$
+\Lambda_{t,w}
+=============
+
+\frac{G}{t} , \mu_{t,w}
+$$
+
+这个式子非常关键。
+它说明：TP 的 tradeoff 不是抽象的，而是直接落在
+
+* 单 instance throughput：$\mu_{t,w}$
+* replica 数：$G/t$
+
+两者的乘积上。
+
+---
+
+## 八、为什么高负载下大 TP 很难赢：一个几乎是定理级别的结论
+
+比较 TP=$t$ 和 TP=1 的 cluster capacity：
+
+$$
+\frac{\Lambda_{t,w}}{\Lambda_{1,w}}
+===================================
+
+\frac{\mu_{t,w}}{t \mu_{1,w}}
+$$
+
+因此，大 TP 只有在下面这个条件成立时，才能在固定总 GPU 数下提升 cluster capacity：
+
+$$
+\mu_{t,w} > t \mu_{1,w}
+$$
+
+也就是说，**单 instance throughput 必须超过线性扩展**。
+
+但这是非常苛刻的，因为：
+
+* compute 只能理想到线性
+* 实际有 $\eta_t^{\text{comp}} < 1$
+* 还有额外的 $T_b^{\text{comm}}(t)$
+* 还有 runtime overhead
+
+所以在稳态下，通常只能有：
+
+$$
+\mu_{t,w} < t \mu_{1,w}
+$$
+
+从而：
+
+$$
+\Lambda_{t,w} \le \Lambda_{1,w}
+$$
+
+这就是为什么在**固定总 GPU 数**下：
+
+> **大 TP 几乎不可能提升饱和状态下的 cluster total capacity。**
+
+它能提升的是：
+
+* 单请求 latency
+* 单 instance 的 service time
+
+但它通常不能提升**总集群容量**。
+
+这就是你图里最核心的 principle。
+
+---
+
+## 九、为什么低负载时 TP4 更好，但高负载时 TP1/TP2 更好
+
+### 9.1 低负载：系统是 arrival-limited
+
+观测到的 goodput 近似为：
+
+$$
+g_{t,w}^{\text{obs}}
+\approx
+\frac{1}{G}
+\min
+\left{
+\lambda_w^{\text{good}},
+\Lambda_{t,w}
+\right}
+$$
+
+其中：
+
+* $\lambda_w^{\text{good}}$ 是外部 offered good-token arrival rate
+* $\Lambda_{t,w}$ 是系统 capacity
+
+当低负载时：
+
+$$
+\lambda_w^{\text{good}} \ll \Lambda_{t,w}
+$$
+
+对所有 TP 都成立，所以：
+
+$$
+g_{t,w}^{\text{obs}}
+\approx
+\frac{\lambda_w^{\text{good}}}{G}
+$$
+
+因此不同 TP 的 observed goodput/GPU 看起来几乎一样。
+
+这就解释了你在 $time_scale=0.5$ 看到的现象：
+
+* goodput/GPU 差异很小
+* 但 TP4 latency 更好
+
+因为这时真正显现出来的是：
+
+$$
+T_b(4) < T_b(1)
+$$
+
+而 capacity 差异被 arrival-limited 掩盖了。
+
+---
+
+### 9.2 高负载：系统开始 capacity-limited
+
+当：
+
+$$
+\lambda_w^{\text{good}} \approx \Lambda_{t,w}
+$$
+
+或者超过它时，系统进入高利用率区。
+此时：
+
+* observed goodput 开始接近 capacity
+* queueing term 开始爆炸
+
+因为利用率可以写成：
+
+$$
+\rho_{t,w}
+==========
+
+\frac{\lambda_w^{\text{good}}}{\Lambda_{t,w}}
+$$
+
+而大 TP 通常让 $\Lambda_{t,w}$ 下降，所以会让 $\rho_{t,w}$ 上升。
+
+一旦 $\rho_{t,w}$ 接近 $1$，你就会看到：
+
+* queueing delay 急剧上升
+* p95 TTFT 急剧恶化
+* SLO pass 数下降
+* goodput/GPU 反而输给小 TP
+
+这就解释了为什么高负载下 TP1/TP2 更优。
+
+---
+
+## 十、为什么 coder 比 chat 更容易出现这个现象
+
+这可以从两个层面解释。
+
+### 10.1 batch cost 的凸性更强地惩罚高方差长度分布
+
+前面已经有：
+
+$$
+F_b
+===
+
+a d^2 \sum_i x_i + b d \sum_i x_i^2
+$$
+
+而：
+
+$$
+\sum_i x_i^2
+============
+
+n_b \left( \bar{x}_b^2 + \operatorname{Var}_b(x) \right)
+$$
+
+所以 coder 如果具有更高的：
+
+* long tail
+* length variance
+* long fraction
+
+那么 batch runtime 就会更高，而且不是线性地更高。
+
+---
+
+### 10.2 queueing tail 还会额外放大 runtime variance
+
+你真正关心的是 p95 TTFT。
+而 p95 受的不仅是 $\mathbb{E}[T_b]$，还受 $\mathbb{E}[T_b^2]$ 甚至更高阶尾部分布影响。
+
+如果 coder 的 batch runtime 分布更 heavy-tail，那么：
+
+$$
+\mathbb{E}[R_t]
+===============
+
+\frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}
+$$
+
+也会更大。
+
+所以 coder 更早进入：
+
+* residual time 大
+* queue 深
+* p95 爆掉
+
+的 regime。
+
+---
+
+## 十一、因此你应该怎么总结这条 principle
+
+你现在最该写的不是：
+
+> 负载越大，TP 越小越好。
+
+而应该写成下面这个更强、更准确的形式：
+
+---
+
+### Principle
+
+在固定总 GPU 数下，TP 同时影响两个不同层面的量：
+
+1. **单 instance 的 batch runtime $T_b(t)$**
+   更大的 TP 往往降低 compute time，因此改善低负载下的 TTFT。
+
+2. **集群总 capacity $\Lambda_{t,w} = \frac{G}{t}\mu_{t,w}$**
+   由于 TP 扩展通常只有次线性速度提升，且引入额外 collective 开销，因此 $\mu_{t,w}$ 很难超过线性增长，导致 cluster capacity 通常随 TP 增大而不升反降。
+
+因此：
+
+* 在低负载区，系统是 arrival-limited，capacity 差异被隐藏，TTFT 主要由 $T_b(t)$ 决定，所以较大 TP 更优。
+* 在高负载区，系统是 capacity- / queueing-limited，$\Lambda_{t,w}$ 和 $\rho_{t,w}$ 主导表现，因此较小或中等 TP 更优。
+
+---
+
+## 十二、把你图里的 observation 严格落到这个模型上
+
+你现在的图，其实支持的是下面这句话：
+
+### 低载区
+
+在 sampled low-load windows 上，
+
+$$
+\lambda_w^{\text{good}} \ll \Lambda_{t,w}, \quad \forall t \in {1,2,4}
+$$
+
+因此 observed goodput/GPU 相近；同时
+
+$$
+T_b(4) < T_b(1)
+$$
+
+所以 TP4 的 p95 TTFT 更好。
+
+---
+
+### 高载区
+
+随着 $time_scale$ 增大，$\lambda_w^{\text{good}}$ 被放大，系统开始逼近：
+
+$$
+\lambda_w^{\text{good}} \approx \Lambda_{t,w}
+$$
+
+而由于通常有：
+
+$$
+\Lambda_{4,w} < \Lambda_{2,w} \lesssim \Lambda_{1,w}
+$$
+
+或者至少：
+
+$$
+\Lambda_{4,w} < \Lambda_{2,w}
+$$
+
+所以 TP4 更早进入高利用率区，queueing tail 更早爆炸，于是：
+
+* goodput/GPU 开始输给小 TP
+* p95 TTFT 恶化更明显
+
+这就是你观察到的 phase change。
+
+---
+
+## 十三、如果你要把这个模型真正用于预测，下一步该怎么做
+
+你现在可以不再泛泛地说“workload affects TP”，而是直接拟合下面这些量。
+
+### 需要拟合的量
+
+| 量                      | 含义                                | 如何估计                                                                    |
+| ---------------------- | --------------------------------- | ----------------------------------------------------------------------- |
+| $a,b$                  | prefill FLOP 模型常数                 | 用 isolation microbench 拟合 batch runtime 对 $\sum x_i$ 和 $\sum x_i^2$ 的系数 |
+| $\eta_t^{\text{comp}}$ | TP 计算扩展效率                         | 对固定 batch 组合测 $T_b^{\text{comp}}$ 反推                                    |
+| $\alpha,\beta$         | collective latency / bandwidth 参数 | 用 NCCL microbench 或 server log 中 collective 时间拟合                        |
+| $T_b^{\text{rt}}(t)$   | runtime residual                  | 用实测 $T_b - T_b^{\text{comp}} - T_b^{\text{comm}}$ 得到                    |
+| $\mu_{t,w}$            | 单 instance token throughput       | 直接从 trace replay 实测 batch token / runtime 统计                            |
+| $\Lambda_{t,w}$        | cluster capacity                  | 用 $\frac{G}{t}\mu_{t,w}$ 计算                                             |
+
+---
+
+## 十四、最本质的一句话
+
+这条 principle 的本质其实可以压缩成一句话：
+
+> **大 TP 优化的是“单个 batch 跑多快”，小 TP 优化的是“固定 GPU 预算下整个 cluster 同时能接多少活”。**
+
+在低负载时，前者主导。
+在高负载时，后者主导。
+
+---
+
+如果你愿意，我下一条可以继续直接把这个模型再往前推一步，给你一版**可写进论文的小节**，包括：
+
+* 精确定义
+* Proposition
+* 证明思路
+* 为什么会出现 TP4 $\rightarrow$ TP2 $\rightarrow$ TP1 的 regime shift。
--- a/projects/auto-tuner/Untitled
+++ b/projects/auto-tuner/Untitled
@@ -0,0 +1,601 @@
+## 先把整篇 paper 的核心脉络修正成一个严谨版本
+
+你现在的直觉是对的，但原始表述里有一个关键逻辑漏洞：
+
+> 不能简单说“$hardware \times engine \times model$ 交给 AITuner，$workload \times SLO$ 交给 principles”。
+
+这样说的问题在于，**principles 也不可能脱离 $hardware \times engine \times model$ 独立成立**。
+更准确的说法应该是：
+
+> **$hardware \times engine \times model$ 决定性能曲面的形状与可行配置集合；$workload \times SLO$ 决定系统当前落在哪个 operating regime，以及哪些 tradeoff 正在主导最优配置。**
+
+也就是说，这两组变量的角色不同，不是“一个重要，一个不重要”，而是：
+
+* **$hardware, engine, model$ 是 platform-defining axes**
+* **$workload, SLO$ 是 regime-defining axes**
+
+这是你整篇 paper 最应该抓住的结构。
+
+---
+
+## 一、把问题形式化：整篇 paper 真正在求什么
+
+设：
+
+* $h$ 表示 hardware
+* $e$ 表示 engine
+* $m$ 表示 model
+* $w$ 表示 workload signature
+* $s$ 表示 SLO profile
+* $\theta$ 表示 serving configuration
+
+那么最优配置其实是：
+
+$$
+\theta^*(h,e,m,w,s) =
+\arg\max_{\theta \in \Theta(h,e,m)} G(\theta; h,e,m,w)
+\quad
+\text{s.t.} \quad L(\theta; h,e,m,w) \le s
+$$
+
+其中：
+
+* $\Theta(h,e,m)$ 是 **在特定平台上合法且可部署的配置空间**
+* $G(\cdot)$ 是 throughput / goodput 类目标
+* $L(\cdot)$ 是 latency tail，例如 $p95$ TTFT
+
+这个式子直接说明了两件事：
+
+### **1. $h,e,m$ 不能被 principles 忽略**
+
+因为它们决定：
+
+* 哪些 knobs 存在
+* 哪些组合合法
+* 每个 knob 改动的收益和代价
+* crossover point 在哪里
+
+### **2. $w,s$ 不能被 simulator/emulator 忽略**
+
+因为它们决定：
+
+* 哪个 bottleneck 被激活
+* latency headroom 是否紧张
+* 当前 regime 是 service-time-dominated 还是 queueing-dominated
+* 哪种 tradeoff 才是“当前最重要的”
+
+---
+
+## 二、你真正想表达的 thesis，应该改写成下面这个版本
+
+## **Refined thesis**
+
+> **A full predictive emulator over $hardware \times engine \times model \times workload \times SLO$ is brittle in the face of rapid engine/model evolution, changing knob semantics, and shifting legality constraints. However, the entire space is not equally hard: $hardware \times engine \times model$ primarily shapes the response surface, while $workload \times SLO$ determines which tradeoff regime is active. We therefore extract directional principles over workload--SLO regimes, and use an online AITuner to instantiate and calibrate those principles for each concrete hardware--engine--model setting.**
+
+这段话比“把前者交给 AITuner，后者交给 principles”严谨很多，因为它明确了：
+
+* principles 不是在替代 HEM
+* AITuner 也不是在替代 WS principles
+* 两者是 **分层协作**
+
+---
+
+## 三、为什么“全空间 emulator”这条路不稳
+
+这里不能只说“因为 engine 和 model 演化快”，这还不够。
+更严谨的论证应该有四层。
+
+### **(1) 配置空间本身在变**
+
+随着 engine / model 演化：
+
+* 新 knobs 被引入
+* 老 knobs 语义改变
+* knob 间约束变化
+* 某些组合从合法变非法，或反过来
+
+因此你不是在一个固定的 $\Theta$ 上做预测，而是在一个不断变化的 $\Theta(h,e,m)$ 上做预测。
+
+---
+
+### **(2) 机制系数在变**
+
+即使 knobs 名字相同，性能响应也会变。比如：
+
+* engine 改了 scheduler
+* collective backend 改了
+* CUDA graph 路径改了
+* model 改了 MoE routing / attention kernel / KV layout
+
+这会直接改变：
+
+* TP 的收益曲线
+* EP 的通信代价
+* batching knobs 的排队行为
+* runtime knobs 的边际收益
+
+也就是性能曲面的“几何形状”在变。
+
+---
+
+### **(3) 你需要的不只是 ranking，还需要 feasibility**
+
+对于 serving，最关键的不只是“谁更快”，而是：
+
+> **在给定 SLO 下谁 still feasible**
+
+而 feasibility 边界往往是最脆弱、最难模拟的部分，因为它对：
+
+* tail latency
+* burstiness
+* queueing
+* runtime jitter
+
+高度敏感。
+
+---
+
+### **(4) simulator/emulator 的维护成本会越来越高**
+
+即使某一代 HEM 上 emulator 有效，后续也要持续追：
+
+* 新模型
+* 新 kernel
+* 新 engine release
+* 新 hardware interconnect
+* 新 serving path
+
+所以问题不是“能不能建 emulator”，而是：
+
+> **能否持续维护一个在 rapidly evolving stack 上仍然可信的 emulator**
+
+这就是你 paper 的关键动机之一。
+
+---
+
+## 四、你 paper 的真正贡献，不是“不要模型”，而是“换一个抽象层”
+
+你不是在说：
+
+* 不做建模
+* 不做 white-box
+* 全交给在线搜索
+
+你真正应该说的是：
+
+> **我们放弃对整个五维空间做脆弱的精确响应预测，转而提炼更稳定的、面向 operating regime 的 directional principles。**
+
+这是一个很重要的层次转换：
+
+### **不是**
+
+预测整个函数：
+
+$$
+(h,e,m,w,s,\theta) \mapsto \text{performance}
+$$
+
+### **而是**
+
+提炼一组更稳定的规则：
+
+$$
+(w,s) \mapsto \text{which tradeoff regime is active}
+$$
+
+再让 AITuner 在给定 $(h,e,m)$ 下去确定：
+
+* crossover point
+* feasible boundary
+* exact winner
+
+所以 AITuner 的角色不是“暴力兜底”，而是：
+
+> **在具体平台上校准 principles 的边界，并完成最后一段精确搜索。**
+
+---
+
+## 五、这条 story 最关键的一个区分
+
+这是我建议你在 paper 里明确讲出来的一句话：
+
+## **Platform shapes the surface; regime selects the active tradeoff.**
+
+更展开一点：
+
+* **$hardware \times engine \times model$**
+  决定配置空间是否合法、各 knob 的局部响应系数、以及不同 tradeoff 的边界位置。
+
+* **$workload \times SLO$**
+  决定当前系统更像是：
+
+  * latency-headroom-limited
+  * capacity-limited
+  * queueing-tail-limited
+  * routing-communication-limited
+    等哪一类 regime。
+
+这句话非常适合作为 principles section 的开头句，也适合作为 intro 里的一句概括。
+
+---
+
+## 六、把整篇 paper 的主线整理成 5 个逻辑命题
+
+你可以把全文逻辑压成下面五个命题。
+
+### **Claim 1: The full tuning problem is five-dimensional**
+
+最优配置依赖于：
+
+$$
+(h,e,m,w,s)
+$$
+
+任何把其中某些维度当作常量的做法，都只能在局部成立。
+
+---
+
+### **Claim 2: The five dimensions play different roles**
+
+不是每个维度都同样适合被“提前模拟”。
+
+* $h,e,m$：定义平台、决定机制细节、变化快
+* $w,s$：定义 regime、决定 tradeoff 是否激活、在 operational semantics 上更稳定
+
+---
+
+### **Claim 3: Full-stack emulation is brittle**
+
+因为它必须同时追踪：
+
+* evolving legality
+* evolving semantics
+* evolving coefficients
+* evolving feasibility boundaries
+
+---
+
+### **Claim 4: Regime-level principles are more stable and more actionable**
+
+我们不试图预测每个点的精确 performance，
+而是预测：
+
+* 当前 regime 下哪些 knob 更值得调
+* 哪个方向更可能好
+* 哪些区域根本 infeasible
+
+---
+
+### **Claim 5: AITuner uses principles as structured search priors**
+
+也就是说，principles 不是结论陈列，而是 tuner 的先验。
+它们决定：
+
+* 搜哪些 knobs
+* 先搜哪些方向
+* 哪些组合可以剪枝
+* 何时停止继续搜并报告 infeasible
+
+这五个命题连起来，你的故事就完整了。
+
+---
+
+# 七、在这个主线下，principles section 应该如何展开
+
+现在这一节已经不该叫 **workload-to-configuration principles**，
+而应该叫：
+
+## **Regime-to-Configuration Principles**
+
+或者更完整一点：
+
+## **Configuration Principles under Joint Workload--SLO Regimes**
+
+我更推荐前者，短而有力。
+
+---
+
+## 这一节的职责
+
+这一节不是要给出全空间 predictive model，
+而是要回答：
+
+1. **为什么最优配置必须 jointly depend on workload and SLO**
+2. **这些 joint regimes 会激活哪些核心 tradeoff**
+3. **这些 tradeoff 如何映射到不同 knob families**
+4. **AITuner 如何把这些 principles 变成 structured search prior**
+
+---
+
+# 八、principles section 的推荐结构
+
+## **Section X: Regime-to-Configuration Principles**
+
+### **X.1 From full-space tuning to regime-guided search**
+
+这是全节的 framing 小节。
+
+它做三件事：
+
+* 给出五维问题定义
+* 说明为什么不做全空间 emulator
+* 说明为什么 principles 聚焦在 $w \times s$
+
+这一小节里最重要的一句话是：
+
+> We do not model away hardware, engine, or model diversity; instead, we let AITuner resolve them online, while using workload--SLO principles to identify which tradeoff regime is active and which parts of the configuration space are worth exploring.
+
+这句话能把你整篇 paper 和 simulator/emulator work 区分开。
+
+---
+
+### **X.2 Why workload-only principles are insufficient**
+
+这一节用你刚刚那两张图来引入。
+
+核心观察是：
+
+* 同一个 trace window
+* 只改 SLO profile
+* winner TP 会变化
+* 甚至会从 feasible 变成 none
+
+所以不能只写：
+
+> high load $\rightarrow$ small TP
+
+而必须写成：
+
+> under a given load and heterogeneity profile, the preferred TP still depends on SLO tightness.
+
+这一节主要负责把 **SLO 引入 principles story**。
+
+---
+
+### **X.3 Principle I: TP trades latency headroom for aggregate concurrency**
+
+这是 TP 小节。
+
+它不再只是 workload principle，而是 regime principle：
+
+* **tight SLO** 偏向大 TP，因为需要更低单-replica latency
+* **relaxed SLO + high load** 偏向小/中 TP，因为需要更高 aggregate concurrency
+* **heterogeneity** 会让转折更早发生
+
+这一节最好只保留一句极简机制：
+
+$$
+\text{replica count} = \frac{G}{t}
+$$
+
+再结合图讲：
+
+* latency headroom
+* queue buildup
+* infeasible regions
+
+---
+
+### **X.4 Principle II: EP helps only when expert traffic amortizes routing cost under the target SLO**
+
+这是 EP 小节。
+
+建议结构是：
+
+* workload signal：
+
+  * MoE token volume
+  * expert skew
+  * prefill/decode mix
+* SLO signal：
+
+  * strict SLO 时 routing jitter 更危险
+  * relaxed SLO 时更能容忍通信换吞吐
+* principle：
+
+  * 只有当 expert-side compute 足够大，且 routing/communication 能被摊薄时，EP 才值得
+  * 否则不开 EP 更稳
+
+这会和 TP 小节形成平行结构。
+
+---
+
+### **X.5 Principle III: Batching knobs reshape queueing tails under heterogeneity and SLO pressure**
+
+这是 batching 小节。
+
+建议聚焦：
+
+* workload signal：
+
+  * length CV
+  * long-request fraction
+  * burstiness
+* SLO signal：
+
+  * strict tail SLO 不容忍长短请求互相拖累
+* principle：
+
+  * strict SLO 下，batching 通常更保守
+  * relaxed SLO 下，可以更 aggressive packing 追吞吐
+
+这一节很适合连接你后面的 queueing story。
+
+---
+
+### **X.6 Principle IV: Runtime-overhead knobs matter only when latency headroom is scarce**
+
+这一节讲 CUDA graph、launch amortization、capture sizes 一类 knobs。
+
+核心是：
+
+* 这些 knobs 不是一阶 knobs in all regimes
+* 它们主要在：
+
+  * short requests
+  * small batches
+  * strict SLO
+  * overhead-sensitive regime
+    中决定 feasibility
+* 在 heavy prefill 或 communication-dominated regime 下，它们往往不是首要问题
+
+---
+
+### **X.7 Summary: Principles as structured search priors**
+
+这一节非常关键。
+
+它要把前面的 principle 收束成：
+
+* 哪些 regime signal 决定先调哪个 knob family
+* 哪些 region 可以直接剪枝
+* 哪些 regime 应该直接报告 infeasible，而不是继续搜
+
+这节最好配一个 summary table。
+
+---
+
+# 九、这一节的统一模板
+
+为了让 TP、EP、batching、runtime 四个小节看起来像同一类东西，建议每节严格遵循同一模板。
+
+| 小节组成                       | 内容                                     |
+| -------------------------- | -------------------------------------- |
+| **Observation**            | 图里看到什么 regime shift                    |
+| **Mechanism**              | 一个最核心的 tradeoff，不展开长推导                 |
+| **Regime dependence**      | workload feature 和 SLO feature 各自怎么起作用 |
+| **Implication for tuning** | 如何缩小 search space / 识别 infeasible      |
+
+这样 TP 不会写成独立论文，EP/batching/runtime 也容易保持风格一致。
+
+---
+
+# 十、这一节里最值得保留的全局公式
+
+正文里我建议只保留一个全局目标公式，用来统一整节。
+
+$$
+\theta^*(h,e,m,w,s)
+===================
+
+\arg\max_{\theta \in \Theta(h,e,m)}
+; G(\theta; h,e,m,w)
+\quad
+\text{s.t.}
+\quad
+L(\theta; h,e,m,w) \le s
+$$
+
+然后紧接着给一句解释：
+
+* $h,e,m$ 定义 feasible space 和 local response surface
+* $w,s$ 选择当前 active regime
+* principles 作用于后者，AITuner 校准前者
+
+这就够了。
+其他公式都尽量移到 appendix。
+
+---
+
+# 十一、推荐的整节 LaTeX 骨架
+
+```latex
+\section{Regime-to-Configuration Principles}
+\label{sec:principles}
+
+Serving performance depends on the joint space of hardware, engine, model,
+workload, and SLO. The optimal configuration is therefore
+\[
+\theta^*(h,e,m,w,s)
+=
+\arg\max_{\theta \in \Theta(h,e,m)}
+\; G(\theta; h,e,m,w)
+\quad
+\text{s.t.}
+\quad
+L(\theta; h,e,m,w) \le s .
+\]
+Rather than building a brittle full-stack emulator over this entire space, we
+separate the problem into two roles. Hardware, engine, and model determine the
+feasible configuration set and shape the local performance surface. Workload
+and SLO determine which operating regime is active, and thus which tradeoff is
+most likely to govern the optimum. We therefore extract regime-to-configuration
+principles over workload--SLO regimes, and let AITuner instantiate them online
+for each concrete hardware--engine--model setting.
+
+\subsection{Why workload-only principles are insufficient}
+\label{sec:principles-why-not-workload-only}
+% \TODO{Use the multi-SLO TP figure.}
+% \TODO{Quantify how often the winner changes as the SLO changes.}
+
+\subsection{Principle I: TP trades latency headroom for aggregate concurrency}
+\label{sec:principles-tp}
+% \TODO{Use TP winner heatmap and one supporting line chart.}
+
+\subsection{Principle II: EP helps only when expert traffic amortizes routing cost under the target SLO}
+\label{sec:principles-ep}
+% \TODO{Insert EP figure.}
+
+\subsection{Principle III: Batching knobs reshape queueing tails under heterogeneity and SLO pressure}
+\label{sec:principles-batching}
+% \TODO{Insert batching figure.}
+
+\subsection{Principle IV: Runtime-overhead knobs matter when latency headroom is scarce}
+\label{sec:principles-runtime}
+% \TODO{Insert runtime-overhead figure.}
+
+\subsection{Summary: principles as structured search priors}
+\label{sec:principles-summary}
+% \TODO{Insert summary table mapping regime signals to knob priorities,
+% candidate directions, and infeasibility actions.}
+```
+
+---
+
+# 十二、建议你在 summary 小节里放的表
+
+这张表会非常有力量，因为它把你的 principles 直接连接到 tuner design。
+
+| Regime signal                 | Tight SLO effect                      | Dominant bottleneck     | Preferred knob direction        | Tuner action                      |
+| ----------------------------- | ------------------------------------- | ----------------------- | ------------------------------- | --------------------------------- |
+| Low load, low queueing        | Headroom scarce                       | Single-replica latency  | Larger TP                       | Search larger TP first            |
+| High load, near saturation    | Headroom less important               | Aggregate concurrency   | Smaller / intermediate TP       | Search smaller TP first           |
+| High expert traffic, low skew | Tight SLO may penalize routing jitter | Expert compute          | Consider EP only if amortizable | Probe EP boundary                 |
+| High length heterogeneity     | Tight SLO amplifies tails             | HOL blocking / queueing | More conservative batching      | Reduce batch aggressiveness       |
+| Small batches, short requests | Tight SLO exposes launch overhead     | Runtime overhead        | Tune graph/capture knobs        | Prioritize overhead knobs         |
+| No feasible config            | Any                                   | Budget-limited regime   | None                            | Scale out / relax SLO / shed load |
+
+最后这一行非常重要，它把 `none` 合法化为系统输出，而不是“实验失败”。
+
+---
+
+# 十三、最关键的一句 paper-level message
+
+我建议你把整篇 paper 的主线凝练成下面这句：
+
+> **We do not attempt to emulate the full five-dimensional tuning space. Instead, we exploit the asymmetry between platform axes and operating-regime axes: hardware, engine, and model determine the local response surface, while workload and SLO determine which tradeoff regime is active. This lets us replace brittle full-space prediction with regime-guided search, instantiated online by AITuner.**
+
+这句话足够当：
+
+* abstract 里的核心句
+* intro 末尾的 problem statement
+* principles section 的 framing sentence
+
+---
+
+# 十四、最后一个关键提醒
+
+你现在这条 story 已经很接近正确了，但一定不要过度声称：
+
+> workload--SLO principles are universal across all hardware, engines, and models
+
+更稳的说法应该是：
+
+> **the directional principles are stable, while the exact crossover points and feasibility boundaries are platform-dependent and are resolved online by AITuner.**
+
+这句话非常重要。
+它既保住了你的 principle story，又不会被 reviewer 用“换个 engine 就未必成立”轻易击穿。
+
+---
+
+下一步最自然的动作，是把这一整节直接写成**论文可用的完整 prose**：先写 **X.1 From full-space tuning to regime-guided search**，再写 **X.2 Why workload-only principles are insufficient**，这样整节开头就真正立住了。
--- a/projects/auto-tuner/Untitled
+++ b/projects/auto-tuner/Untitled
@@ -0,0 +1,46 @@
+R1  default run.sh, 4 GPUs, 1.0x
+    QPS 0.110594  Goodput 0.108720  Goodput/GPU 0.027180
+    TTFT 1171.53 / 2566.92 ms  TPOT 7.56 / 11.30 ms  Pass 98.31%
+    Diagnosis: underutilized, but this round had a client-side stream-line bug.
+    Action: fix harness and continue with a clean confirmation later.
+
+R2  GPU_MEMORY_UTILIZATION=0.8, 4.0x
+    QPS 0.442377  Goodput 0.322410  Goodput/GPU 0.080603
+    TTFT 2306.44 / 5880.85 ms  TPOT 15.51 / 41.96 ms  Pass 72.88%
+    Diagnosis: prefill/queueing-limited.
+    Action: reduce offered load to find the knee.
+
+R3  GPU_MEMORY_UTILIZATION=0.8, 3.0x
+    QPS 0.331783  Goodput 0.269925  Goodput/GPU 0.067481
+    TTFT 1835.28 / 5026.43 ms  TPOT 12.29 / 23.83 ms  Pass 81.36%
+    Diagnosis: still prefill/queueing-limited.
+    Action: try larger prefill batch and remove speculative overhead.
+
+R4  GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, intended no-spec, 3.0x
+    QPS 0.331783  Goodput 0.264301  Goodput/GPU 0.066075
+    TTFT 1882.44 / 5071.41 ms  TPOT 12.16 / 24.34 ms  Pass 79.66%
+    Diagnosis: still prefill-limited; change did not help.
+    Action: patch run.sh so empty SPECULATIVE_CONFIG really disables speculation.
+
+R5  GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 2.0x
+    QPS 0.221188  Goodput 0.202444  Goodput/GPU 0.050611
+    TTFT 1464.60 / 3545.68 ms  TPOT 10.00 / 25.96 ms  Pass 91.53%
+    Diagnosis: improved, but still TTFT/pass-rate limited.
+    Action: retry 2.0x with real no-spec + larger prefill batch.
+
+R6  GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, SPECULATIVE_CONFIG='', 2.0x
+    QPS 0.221188  Goodput 0.198695  Goodput/GPU 0.049674
+    TTFT 1485.97 / 4219.81 ms  TPOT 17.64 / 29.77 ms  Pass 89.83%
+    Diagnosis: no-spec reduced decode step time but did not improve SLO pass rate.
+    Action: stop chasing config knobs; search lower rate frontier.
+
+R7  GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.5x
+    QPS 0.165891  Goodput 0.157456  Goodput/GPU 0.039364
+    TTFT 1338.11 / 3048.92 ms  TPOT 8.60 / 14.51 ms  Pass 94.92%
+    Diagnosis: still TTFT-limited; frontier is below 1.5x.
+    Action: run a clean 1.0x confirmation.
+
+R8  GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.0x
+    QPS 0.110594  Goodput 0.110594  Goodput/GPU 0.027649
+    TTFT 1202.72 / 2596.63 ms  TPOT 7.53 / 11.25 ms  Pass 100.00%
+    Diagnosis: compliant and underutilized.
--- a/projects/auto-tuner/Untitled.md
+++ b/projects/auto-tuner/Untitled.md
@@ -0,0 +1,362 @@
+You are an expert Python systems engineer working on an LLM inference auto-tuning project.
+
+The repository layout currently looks like:
+
+```
+llm_autotune/
+├── adapter/
+│   └── __init__.py
+├── config_generator.py
+├── config_space/
+│   └── __init__.py
+├── feasibility/
+│   ├── filter_configs.py
+│   ├── __init__.py
+│   ├── memory_model.py
+│   ├── model_meta.py
+│   ├── topology_rules.py
+│   └── __pycache__/...
+├── harness/
+│   └── __init__.py
+├── inspector/
+│   ├── hardware_inspector.py
+│   ├── probe.py
+│   ├── workload_profiler.py
+│   └── __pycache__/...
+├── search/
+│   ├── heuristic.py
+│   └── __init__.py
+├── store/
+│   ├── json_store.py
+│   └── __init__.py
+├── util/
+│   └── __init__.py
+└── scripts/
+    ├── run_inspect.py
+    ├── run_search.py
+    ├── run_vllm_benchmark.py
+    └── run_workload_benchmarks.py
+```
+
+# Goal
+
+Implement the core infrastructure for an AI-driven LLM inference config auto-tuner, with four concrete capabilities:
+
+1. A robust profiling & logging tool that can run vLLM benchmarks and record detailed logs for later bottleneck analysis.
+2. A basic config generator that produces VALID vLLM configs for a given hardware + model + workload.
+3. A unified harness to run vidur (simulation) and vLLM (real system) with the same interface, writing detailed profiling logs.
+4. An AI-driven iterative loop that reads historical results, diagnoses bottlenecks, proposes new configs, and calls the harness repeatedly.
+
+# General requirements
+
+- Python 3.12+ with type hints and dataclasses where appropriate.
+- Keep dependencies minimal; assume we can add small, well-known libs if necessary, but prefer the standard library.
+- Avoid ad-hoc global state; pass objects explicitly.
+- Logging:
+	- Use the `logging` module, not `print`, for internal logs.
+	- User-facing scripts may still print concise summaries to stdout.
+- All public functions should have clear docstrings.
+
+# Part 0: Core types & JSON store
+
+1. Create or extend a module `harness/types.py` (or `harness/__init__.py` if you prefer) to define the core data classes:
+
+   - `HardwareProfile`
+	 - High-level fields like: gpu_type, num_gpus, hbm_gb, nvlink_topology (string or simple struct), cpu_cores, system_memory_gb, etc.
+	 - This should be compatible with what `inspector/hardware_inspector.py` can produce.
+
+   - `ModelProfile`
+	 - Fields like: model_name, param_count, hidden_size, num_layers, num_heads, is_moe, num_experts, is_mla, max_position_embeddings, etc.
+	 - This should be compatible with `feasibility/model_meta.py`.
+
+   - `WorkloadProfile`
+	 - Fields like: workload_name, qps, avg_prompt_tokens, p95_prompt_tokens, avg_decode_tokens, p95_decode_tokens, request_type, etc.
+	 - This should be compatible with `inspector/workload_profiler.py`.
+
+   - `VLLMConfig`
+     - A structured representation of the core vLLM config knobs we care about:
+	   - tensor_parallel_size, pipeline_parallel_size, expert_parallel_size, data_parallel_size
+	   - block_size, max_num_batched_tokens, max_num_seqs
+	   - gpu_memory_utilization
+	   - scheduling_policy (string)
+	   - router/admission knobs if applicable
+	   - any other important vLLM engine args we need.
+
+   - `BenchmarkRunConfig`
+     - Fields:
+       - run_id: str
+       - engine: Literal["vllm", "vidur"]
+       - vllm_config: VLLMConfig
+       - workload: WorkloadProfile
+       - objective: str
+       - extra: dict[str, Any] | None (for future extensions)
+
+   - `BenchmarkResult`
+     - Fields:
+       - run_id: str
+       - success: bool
+       - aggregated_metrics: dict[str, float]  # e.g., {"qps": ..., "p95_latency_ms": ..., "ttft_ms": ...}
+       - hw: HardwareProfile
+       - model: ModelProfile
+       - workload: WorkloadProfile
+       - vllm_config: VLLMConfig
+       - error_message: str | None
+       - trace_paths: list[str]  # paths to detailed traces if any
+       - started_at: datetime
+       - finished_at: datetime
+
+   - `BottleneckReport`
+     - A simple summary object that the AI loop could use later, with fields like:
+       - primary_bottleneck: Literal["memory", "compute", "communication", "scheduler", "unknown"]
+       - secondary_bottlenecks: list[str]
+       - notes: str
+
+2. Extend `store/json_store.py` to implement a minimal JSON-based store:
+
+   - Provide at least these methods (or similar, well-documented alternatives):
+
+     ```python
+     class JsonStore:
+         def __init__(self, root: Path): ...
+
+         def create_run_dir(self, run_id: str) -> Path: ...
+         def save_run_config(self, cfg: BenchmarkRunConfig) -> None: ...
+         def save_run_result(self, result: BenchmarkResult) -> None: ...
+         def append_trace_record(self, run_id: str, record: dict[str, Any]) -> None: ...
+     ```
+
+   - The layout on disk should roughly be:
+
+     - `root/run_id/config.json`
+     - `root/run_id/metrics.json`
+     - `root/run_id/traces.jsonl` (optional)
+
+   - These files should be valid JSON and easy to parse later.
+
+# Part 1: vLLM benchmark runner with profiling
+
+Implement a robust vLLM runner that:
+
+- Takes a `BenchmarkRunConfig`, `HardwareProfile`, and `ModelProfile`.
+- Runs a benchmark against vLLM.
+- Collects aggregated metrics and (optionally) time-series traces.
+- Persists everything through `JsonStore`.
+
+1. In `harness/vllm_runner.py` implement:
+
+   ```python
+   from .types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
+   from store.json_store import JsonStore
+
+   def run_vllm_benchmark(
+       run_config: BenchmarkRunConfig,
+       hw: HardwareProfile,
+       model: ModelProfile,
+       store: JsonStore,
+   ) -> BenchmarkResult:
+       """
+       Launch a vLLM benchmark according to run_config, collect metrics, and store logs.
+
+       This function is allowed to:
+       - Use an in-process vLLM Engine OR
+       - Start a vLLM HTTP server as a subprocess and send requests to it.
+
+       It MUST:
+       - Generate a unique run directory via JsonStore.
+       - Save config and metrics via JsonStore.
+       - Return a BenchmarkResult object populated with aggregated metrics.
+       """
+	```
+2. Integrate existing modules:
+- Use inspector/workload_profiler.py to generate the request stream and WorkloadProfile.
+- Optionally call a helper (in a new module harness/profiler.py) that periodically samples GPU utilization, memory usage, etc., and writes records via JsonStore.append_trace_record.
+3. Update scripts/run_vllm_benchmark.py to:
+- Parse CLI arguments (model, workload description, objective, etc.).
+- Instantiate HardwareProfile via inspector/hardware_inspector.py.
+- Instantiate ModelProfile via feasibility/model_meta.py.
+- Instantiate a JsonStore rooted at e.g. ./runs.
+- Build a BenchmarkRunConfig.
+- Call run_vllm_benchmark.
+- Print a concise summary plus the run_id.
+# Part 2: Config generator & validity checking
+We already have feasibility/filter_configs.py, memory_model.py, model_meta.py, and topology_rules.py. Now we need to:
+1. Extend feasibility/filter_configs.py to expose a single, well-typed validator:
+```python
+from harness.types import VLLMConfig, HardwareProfile, ModelProfile, WorkloadProfile
+
+def validate_config(
+    cfg: VLLMConfig,
+    hw: HardwareProfile,
+    model: ModelProfile,
+    workload: WorkloadProfile | None = None,
+) -> tuple[bool, list[str]]:
+    """
+    Check whether a vLLMConfig is valid for the given hardware/model/workload.
+
+    Returns:
+      (is_valid, reasons_if_invalid)
+    """
+```
+- Use memory_model to estimate HBM usage and enforce an upper bound.
+- Use topology_rules to reject obviously bad parallelism combinations.
+- Use simple logical constraints (tp * pp * ep * dp == world_size, etc.).
+2. Implement a minimal "repair" helper in feasibility/filter_configs.py:
+```python
+def repair_config(
+    cfg: VLLMConfig,
+    hw: HardwareProfile,
+    model: ModelProfile,
+    workload: WorkloadProfile,
+) -> VLLMConfig:
+    """
+    Attempt to minimally modify cfg to make it valid.
+
+    Strategies:
+    - Adjust gpu_memory_utilization downwards if memory is too tight.
+    - Reduce max_num_batched_tokens or max_num_seqs if KV cache is too large.
+    - Fix tp/pp/ep/dp product to match world_size.
+    - Raise a clear exception if repair is impossible.
+    """
+```
+3. Implement config_generator.py as the main entry for generating seed configs:
+- Provide:
+```python
+def generate_seed_configs(
+    hw: HardwareProfile,
+    model: ModelProfile,
+    workload: WorkloadProfile,
+    objective: str,
+    max_configs: int = 8,
+) -> list[VLLMConfig]:
+    """
+    Generate a small set of reasonable, VALID seed vLLM configs for the given environment.
+
+    Use:
+    - Handcrafted "template families" for dense/MoE/MLA models.
+    - Different templates for latency vs throughput oriented objectives.
+    - validate_config(...) to filter out invalid configs.
+    """
+```
+- Use the validator and repair helper when constructing these configs.
+# Part 3: Unified runner for vidur(simulation) and vLLM(real)
+We want a single harness function that can run either vidur or vLLM with the same input types and logging behavior.
+1. In adapter/ add two modules:
+- adapter/vllm_adapter.py
+	- Responsible for low-level integration with vLLM (EngineArgs / HTTP service).
+	- Should expose simple primitives such as "apply VLLMConfig" and "send requests".
+- adapter/vidur_adapter.py
+	- Responsible for integrating with the vidur simulator.
+	- Should expose an interface that, given a VLLMConfig and WorkloadProfile, produces metrics comparable to vLLM.
+	Both adapters can be thin now; they can evolve later.
+1. In harness/runner.py implement:
+```python
+from harness.types import BenchmarkRunConfig, BenchmarkResult, HardwareProfile, ModelProfile
+from store.json_store import JsonStore
+
+def run_benchmark(
+    run_config: BenchmarkRunConfig,
+    hw: HardwareProfile,
+    model: ModelProfile,
+    store: JsonStore,
+) -> BenchmarkResult:
+    """
+    Dispatch to the appropriate engine runner ("vllm" or "vidur") based on run_config.engine,
+    then return a BenchmarkResult.
+
+    The interface and logging semantics should be identical for both engines.
+    """
+```
+- Internally call run_vllm_benchmark for "vllm" and run_vidur_simulation for "vidur" (you should create run_vidur_simulation similar to run_vllm_benchmark).
+3. Update / implement scripts/run_workload_benchmarks.py:
+- Accept CLI options like --engine {vllm,vidur}.
+- Loop over a list of configs (seed configs from config_generator) and call run_benchmark for each.
+- Summarize results to stdout.
+# Part 4: AI-driven iterative search loop
+Implement an iterative search loop that calls an external LLM (e.g., OpenAI API) to:
+- Inspect previous configs + metrics.
+- Diagnose bottlenecks.
+- Propose new configs or modifications.
+- Iterate until convergence or a step budget is reached.
+1. In search/ai_loop.py implement:
+```python
+from harness.types import (
+    HardwareProfile,
+    ModelProfile,
+    WorkloadProfile,
+    VLLMConfig,
+    BenchmarkRunConfig,
+    BenchmarkResult,
+)
+
+def ai_search_loop(
+    hw: HardwareProfile,
+    model: ModelProfile,
+    workload: WorkloadProfile,
+    objective: str,
+    engine: str = "vllm",
+    max_steps: int = 20,
+) -> list[BenchmarkResult]:
+    """
+    Core loop:
+
+    - Generate seed configs via config_generator.generate_seed_configs.
+    - For each config:
+      - Create a BenchmarkRunConfig and run a benchmark via harness.run_benchmark.
+      - Store BenchmarkResult objects in a `history` list.
+
+    - For step in range(max_steps):
+      - Summarize `history` into a textual prompt for an external LLM.
+      - Call an LLM client (you may stub this out, or define a placeholder function) that returns:
+        - A bottleneck analysis
+        - A small set of new candidate configs (or edits to existing configs)
+      - For each candidate:
+        - Validate and repair via feasibility.validate_config / repair_config.
+        - Run a benchmark and append to `history`.
+      - Optionally, ask the LLM (or use simple heuristics) to check for convergence.
+    - Return the full history for further analysis.
+    """
+```
+- You do NOT need to implement the actual OpenAI API call in detail; you can define a placeholder interface like:
+```python
+def ask_llm_for_new_configs(
+    hw: HardwareProfile,
+    model: ModelProfile,
+    workload: WorkloadProfile,
+    objective: str,
+    history: list[BenchmarkResult],
+) -> tuple[str, list[VLLMConfig]]:
+    """
+    Returns (bottleneck_analysis_text, candidate_configs).
+    This can be implemented later with a real LLM backend.
+    """
+```
+2. Update scripts/run_search.py to:
+- Parse arguments for model, workload, objective, engine, max_steps.
+- Use hardware_inspector / model_meta / workload_profiler to build the profiles.
+- Instantiate JsonStore.
+- Call ai_search_loop.
+- Print the best configuration and its metrics at the end.
+# Coding style & quality expectations
+- Use clear, descriptive function and variable names.
+- Add type hints to all public functions and dataclasses.
+- Include docstrings that explain the intent, not just restate the name.
+- Use pathlib.Path for filesystem operations.
+- Use logging.getLogger(__name__) in each module.
+# Deliverables summary
+Implement or update the following modules:
+- harness/types.py (or equivalent)
+- store/json_store.py (extend)
+- harness/vllm_runner.py
+- harness/runner.py
+- adapter/vllm_adapter.py
+- adapter/vidur_adapter.py
+- feasibility/filter_configs.py (extend)
+- config_generator.py (implement seed generation)
+- search/ai_loop.py
+- scripts/run_vllm_benchmark.py (update)
+- scripts/run_workload_benchmarks.py (update)
+- scripts/run_search.py (update)
+Focus on clean architecture and clear boundaries between:
+- Config generation & validation (config_generator, feasibility)
+- Execution & profiling (harness, adapter, inspector, store)
+- AI-driven reasoning (search/ai_loop.py)
--- a/projects/auto-tuner/Utils.md
+++ b/projects/auto-tuner/Utils.md
@@ -0,0 +1,129 @@
+# For ali vLLM
+
+- 使用 `https://code.alibaba-inc.com/algo/llm_scripts/tree/main/vllm_v1/pd_ep_qwen_dlc_mpirun` 的启动脚本
+- 修改 vllm 源码的 check 部分使用 BLADNN_ATTN
+- 关闭 eager 避免 build 超时
+- install 最新的 GEMM 支持 FP8
+
+`['BLADNN_ATTN', 'FLASH_ATTN', 'TRITON_ATTN', 'XFORMERS', 'ROCM_ATTN', 'ROCM_AITER_MLA', 'ROCM_AITER_FA', 'TORCH_SDPA', 'FLASHINFER', 'FLASHINFER_MLA', 'TRITON_MLA', 'CUTLASS_MLA', 'FLASHMLA', 'FLASHMLA_SPARSE', 'FLASH_ATTN_MLA', 'PALLAS', 'IPEX', 'DUAL_CHUNK_FLASH_ATTN', 'SPARSE_FLASH_ATTN', 'NO_ATTENTION', 'FLEX_ATTENTION', 'TREE_ATTN', 'ROCM_AITER_UNIFIED_ATTN']`
+
+dashllm:deepep_cp312_test_v1_deepep_274
+
+
+
+# Nsys
+
+```
+nsys profile -o candle_trace --trace=cuda,nvtx \
+cargo run --features cuda --release --example qwen -- --prompt "Hello there " --tokenizer-file /mnt/debugger/wjh/models/Qwen2-7B/tokenizer.json --weight-files /mnt/debugger/wjh/models/Qwen2-7B/model-00001-of-00004.safetensors,/mnt/debugger/wjh/models/Qwen2-7B/model-00002-of-00004.safetensors,/mnt/debugger/wjh/models/Qwen2-7B/model-00003-of-00004.safetensors,/mnt/debugger/wjh/models/Qwen2-7B/model-00004-of-00004.safetensors --model 2-7b
+```
+
+
+embedding: 满足 `abs(candle[i] - torch[i]) < 1e-7`
+
+
+
+
+
+
+
+```
+Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
+ --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
+     60.2      44635338548     371520  120142.5  101728.0     33984    688448      49045.5  fused_moe_kernel                                                                                    
+      5.0       3690934670     182880   20182.3   20160.0     19392     27424        244.2  void cutlass::Kernel2<cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_128x2_tn_align8>(T1::Par…
+      4.2       3094961330     182880   16923.5   16800.0     15936     22688        672.4  ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x6_tn                                 
+      3.8       2842715582     737280    3855.7    3904.0      3328      9696        190.2  void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
+      3.6       2664312838     184320   14454.8   14432.0      5632     23904        947.3  void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm_bf16_64x64_64x6_tn_align8>(T1::Params)         
+      2.0       1468689277     185760    7906.4    7872.0      7775     11040        126.8  void vllm::moe::moe_align_block_size_kernel<int>(const T1 *, int *, int *, int *, int, int, int, in…
+      1.9       1409326000       5310  265409.8  356352.0     16320    367328     151370.8  ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn                                 
+      1.9       1382453159     185760    7442.1    7392.0      6240     13472        369.3  void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<c10::BFloat16, at::native::fu…
+      1.8       1362029871     371520    3666.1    3648.0      3392      7584        229.3  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
+      1.8       1360460198     189630    7174.3    7008.0      5504     17856        718.6  void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
+      1.6       1194245862     136800    8729.9    8704.0      8192     10848        148.8  void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (…
+      1.5       1140946679     375390    3039.4    3136.0      2592     12096        478.5  void vllm::rms_norm_kernel<c10::BFloat16>(T1 *, const T1 *, const T1 *, float, int, int)            
+      1.4       1018883192     185760    5484.9    5472.0      5024      8000         61.5  void vllm::moe::topkGatingSoftmax<(int)4, (int)128, (int)4, (int)16, int>(const float *, const bool…
+      1.2        860705869     561117    1533.9    1376.0      1247      2688        261.6  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<int>, std::array<cha…
+      1.1        826910189     185760    4451.5    4448.0      4256      9216        259.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
+      1.0        756608821     185760    4073.0    4064.0      3840      6048        104.7  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
+      0.8        566094488     185760    3047.5    2944.0      2592     15872        876.4  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
+      0.7        537884109     185760    2895.6    2880.0      2752      3968         45.8  void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, float, __nv_bfloat16, float, (bool)0, __n…
+      0.7        494377950     185760    2661.4    2656.0      2592      4672        104.9  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
+      0.6        460223727     182880    2516.5    2592.0      2208      3264        131.7  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
+      0.6        433242669      46080    9402.0    9408.0      9056     10176        108.0  void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (…
+      0.6        414754860     189630    2187.2    2048.0      2015      9792        827.6  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
+      0.5        337532512     185760    1817.0    1792.0      1759      3616         93.6  void at::native::vectorized_elementwise_kernel<(int)8, at::native::FillFunctor<c10::BFloat16>, std:…
+      0.4        328896261     185760    1770.5    1760.0      1568      3584        131.0  void vllm::moe::count_and_sort_expert_tokens_kernel<int>(const T1 *, int *, int *, unsigned long)   
+      0.4        263914824      46080    5727.3    5696.0      5440      8320         81.1  void flash::flash_fwd_splitkv_combine_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (…
+      0.3        247119823       3870   63855.3   63744.0     62272    103456       1384.2  void at::native::<unnamed>::cunn_SoftMaxForward<(int)4, float, float, float, at::native::<unnamed>:…
+      0.1         79960104       2880   27763.9   30464.0     23712     38752       3833.5  ampere_bf16_s16816gemm_bf16_128x128_ldg8_f2f_stages_64x3_tn                                         
+      0.1         51639494       3870   13343.5   13344.0     10240     16704        288.1  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::ArgMaxOps<…
+      0.1         44284309       2880   15376.5   16224.0     10080     24864       4748.7  void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (…
+      0.0         36073106       3870    9321.2    9344.0      5344     11744        346.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
+      0.0         32250094       3810    8464.6    8448.0      6880     10496        264.8  void at::native::<unnamed>::indexSelectSmallIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2…
+      0.0         25319479       3870    6542.5    6560.0      5632      8224        146.0  void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
+      0.0         22787737       3870    5888.3    5888.0      3008      7584        279.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
+      0.0         22601173       7740    2920.0    3008.0      2688      3968        165.1  void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
+      0.0         20332087       1296   15688.3   15584.0     15008     20128        526.2  void cutlass::Kernel2<cutlass_80_tensorop_bf16_s16816gemm_relu_bf16_64x64_64x6_tn_align8>(T1::Param…
+      0.0         12874940       2880    4470.5    4512.0      3199      7136       1228.3  void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, __nv_bfloat16, __nv_bfloat16, float, (boo…
+      0.0          8327452       1440    5783.0    5760.0      5536      7200        218.4  void cutlass::Kernel2<cutlass_80_wmma_tensorop_s161616gemm_bf16_32x32_128x1_tn_align8>(T1::Params)  
+      0.0          8267196       3870    2136.2    2144.0      1920      2912         63.2  void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIterator…
+      0.0          7524221       3816    1971.8    1984.0      1920      2752         27.5  void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<long>, std::array<ch…
+      0.0          6278206       3870    1622.3    1632.0      1599      2016         23.2  void at::native::unrolled_elementwise_kernel<at::native::CUDAFunctorOnSelf_add<int>, std::array<cha…
+      0.0          5873566       3870    1517.7    1504.0      1376      2048         25.7  void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<int>, std::array<char *, (unsi…
+      0.0          2356448        144   16364.2   16320.0     15648     18720        413.9  ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_64x4_tn                                          
+      0.0           396544         60    6609.1    6608.0      4160     10464       2270.3  void at::native::<unnamed>::indexSelectLargeIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2…
+      0.0           105408         54    1952.0    1952.0      1920      2400         64.3  void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<long>, std::array<char *, (uns…
+      0.0            50944         33    1543.8    1536.0      1408      2080        101.5  void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<int>, std::array<cha…
+```
+
+
+kernels:
+
+```
+  1: ampere_bf16_s16816gemm_bf16_128x128_ldg8_f2f_stages_64x3_tn
+  2: ampere_bf16_s16816gemm_bf16_128x64_ldg8_f2f_stages_64x4_tn
+  3: ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
+  4: ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x6_tn
+  5: fused_moe_kernel
+  6: std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kernel<c10::BFloat16, (int)8>(T1 *, T1 *, const T1 *, float, int, int)
+  7: void at::native::<unnamed>::cunn_SoftMaxForward<(int)4, float, float, float, at::native::<unnamed>::SoftMaxForwardEpilogue>(T4 *, const T2 *, int)
+  8: void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::native::templates::cuda::uniform_and_transform<float, float, at::CUDAGeneratorImpl *, void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T3, T4)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::<unnamed>::distribution_nullary_kernel<float, float, float4, at::CUDAGeneratorImpl *, void at::native::templates::cuda::uniform_and_transform<float, float, at::CUDAGeneratorImpl *, void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T3, T4)::[lambda(curandStatePhilox4_32_10 *) (instance 2)], void at::native::templates::cuda::exponential_kernel<at::CUDAGeneratorImpl *>(at::TensorIteratorBase &, double, T1)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 2)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, T4, const T5 &, T6)::[lambda(int, float) (instance 1)]>(long, at::PhiloxCudaState, T3, T4)
+  9: void at::native::<unnamed>::indexSelectLargeIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2, (int)-2, (bool)1>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<const T1, T3>, at::cuda::detail::TensorInfo<const T2, T3>, int, int, T3, T3, long)
+ 10: void at::native::<unnamed>::indexSelectSmallIndex<c10::BFloat16, long, unsigned int, (int)2, (int)2, (int)-2>(at::cuda::detail::TensorInfo<T1, T3>, at::cuda::detail::TensorInfo<const T1, T3>, at::cuda::detail::TensorInfo<const T2, T3>, int, int, T3, long)
+ 11: void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::DivFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+ 12: void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+ 13: void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 12)]::operator ()() const::[lambda(c10::BFloat16) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
+ 14: void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void at::native::index_kernel_impl<at::native::OpaqueType<(int)2>>(at::TensorIteratorBase &, c10::ArrayRef<long>, c10::ArrayRef<long>)::[lambda(char *, const char *, long) (instance 1)]>(at::TensorIteratorBase &, c10::ArrayRef<long>, c10::ArrayRef<long>, const T1 &)::[lambda(int) (instance 1)]>(long, T3)
+ 15: void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<c10::BFloat16, at::native::func_wrapper_t<c10::BFloat16, at::native::sum_functor<c10::BFloat16, float, c10::BFloat16>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, c10::BFloat16, (int)4, (int)4>>(T3)
+ 16: void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::ArgMaxOps<float>, unsigned int, long, (int)4, (int)4>>(T3)
+ 17: void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::sum_functor<float, float, float>::operator ()(at::TensorIterator &)::[lambda(float, float) (instance 1)]>, unsigned int, float, (int)4, (int)4>>(T3)
+ 18: void at::native::unrolled_elementwise_kernel<at::native::CUDAFunctorOnSelf_add<int>, std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+ 19: void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>, (int)8, TrivialOffsetCalculator<(int)0, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+ 20: void at::native::unrolled_elementwise_kernel<at::native::FillFunctor<long>, std::array<char *, (unsigned long)1>, (int)8, TrivialOffsetCalculator<(int)0, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithoutCast, at::native::memory::StoreWithoutCast>(int, T1, T2, T4, T5, T6, T7)
+ 21: void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 3)]::operator ()() const::[lambda(int) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+ 22: void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 4)]::operator ()() const::[lambda(long) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+ 23: void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 3)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)], std::array<char *, (unsigned long)2>, (int)8, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T4, T5, T6, T7)
+ 24: void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+ 25: void at::native::vectorized_elementwise_kernel<(int)2, at::native::FillFunctor<long>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+ 26: void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, float, at::native::binary_internal::DivFunctor<float>>, std::array<char *, (unsigned long)3>>(int, T2, T3)
+ 27: void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<int>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+ 28: void at::native::vectorized_elementwise_kernel<(int)8, at::native::FillFunctor<c10::BFloat16>, std::array<char *, (unsigned long)1>>(int, T2, T3)
+ 29: void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, __nv_bfloat16, __nv_bfloat16, float, (bool)0, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, (bool)1, (bool)0, (bool)0>(cublasLt::cublasSplitKParams<T6>, const T4 *, const T9 *, T8 *, T5 *, const T6 *, const T6 *, const T10 *, const T4 *, T10 *, void *, long, T6 *, int *, T6 *, const T6 *, const T6 *, const T6 *, const T6 *)
+ 30: void cublasLt::splitKreduce_kernel<(int)32, (int)16, int, float, __nv_bfloat16, float, (bool)0, __nv_bfloat16, __nv_bfloat16, __nv_bfloat16, (bool)1, (bool)0, (bool)0>(cublasLt::cublasSplitKParams<T6>, const T4 *, const T9 *, T8 *, T5 *, const T6 *, const T6 *, const T10 *, const T4 *, T10 *, void *, long, T6 *, int *, T6 *, const T6 *, const T6 *, const T6 *, const T6 *)
+ 31: void cutlass::Kernel2<cutlass_80_tensorop_bf16_s16816gemm_relu_bf16_64x64_64x6_tn_align8>(T1::Params)
+ 32: void cutlass::Kernel2<cutlass_80_tensorop_s16816gemm_bf16_64x64_64x6_tn_align8>(T1::Params)
+ 33: void cutlass::Kernel2<cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_128x2_tn_align8>(T1::Params)
+ 34: void cutlass::Kernel2<cutlass_80_wmma_tensorop_s161616gemm_bf16_32x32_128x1_tn_align8>(T1::Params)
+ 35: void flash::flash_fwd_splitkv_combine_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (int)4, (int)1, (bool)1>(flash::Flash_fwd_params)
+ 36: void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)0, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)0, (bool)0>(flash::Flash_fwd_params)
+ 37: void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)0, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)1, (bool)0>(flash::Flash_fwd_params)
+ 38: void flash::flash_fwd_splitkv_kernel<Flash_fwd_kernel_traits<(int)128, (int)64, (int)128, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, Flash_kernel_traits<(int)128, (int)64, (int)128, (int)4, cutlass::bfloat16_t>>, (bool)1, (bool)0, (bool)0, (bool)0, (bool)1, (bool)0, (bool)0, (bool)0>(flash::Flash_fwd_params)
+ 39: void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, const T1 *, int)
+ 40: void vllm::moe::count_and_sort_expert_tokens_kernel<int>(const T1 *, int *, int *, unsigned long)
+ 41: void vllm::moe::moe_align_block_size_kernel<int>(const T1 *, int *, int *, int *, int, int, int, int, unsigned long, int *)
+ 42: void vllm::moe::topkGatingSoftmax<(int)4, (int)128, (int)4, (int)16, int>(const float *, const bool *, float *, int, T5 *, int *, int, int, int)
+ 43: void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0>(const T1 *, const T1 *, T2 *, T2 *, const long *, long, long, long, long, long, int, int, int, const float *, const float *)
+ 44: void vllm::rms_norm_kernel<c10::BFloat16>(T1 *, const T1 *, const T1 *, float, int, int)
+ 45: void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, int, long, long, long, int, int, int)
+```
--- a/projects/auto-tuner/ali
+++ b/projects/auto-tuner/ali
@@ -0,0 +1,414 @@
+
+## TODO
+
+done:
+- codex-chat
+- codex-chat-5090: codex resume 019d4945-4991-7331-a848-1be6fd702e9f
+- codex-coder
+
+- scoot-chat
+- scoot-thinking-prefill
+- scoot-thinking-decode
+
+dash1: codex-thinking-decode
+dash2: codex-thinking-prefill
+dash3: scoot-coder
+dash5: scoot-chat-5090
+
+
+```bash
+# chat
+/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_chat_day0_t0p002_fixedcount/sampled_traces/chat_w20260311_peak_1000.jsonl
+
+# prefill-only
+/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_thinking_day0_t0p04_fixedcount/sampled_traces/thinking_w20260323_peak_1000.jsonl
+
+# coder
+/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_coder_peak_7day_fixedcount/sampled_traces/coder_w20260311_peak_1000.jsonl
+
+# decode-only
+/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/plans/dash0123_8gpu__qwen235b__internal__decode_only__thinking__legal11_thinking_decode_only_weekly0321_0327_peak_local8/traces/thinking_w20260321_peak_1000.jsonl
+```
+
+所以我会在代码里直接把 internal profile 的“必须 chunked prefill”为硬约束，而不是继续靠超时失败去学
+
+Fig 7/8 加上和 real trace 相同的 semi-real
+
+ongoing:
+dash0: qwen235b decode-only 测试
+dash1/2: qwen235b thinking 30min 测试
+dash3: qwen-coder-next coder 30min 测试
+dash5: 5090 qwen27b chat-0-32k 测试
+
+
+4.1
+证明不同 workload 不能用来 tune 不同 cluster 的数据
+
+✅4.2
+✅synthetic/semi-real/real 的性能对比数据:
+83.91, 98.19, 98.4
+65.22, 86,03,98.28
+
+✅synthetic/semi-real/real 的相似度对比数据
+
+chat/thinking/coder 的 prefix 下 tuned best 的对比数据
+chat/thinking/coder 的 prefix 下的相似度对比数据
+
+4.3
+agent harness 总结
+
+5
+tuner vs baseline
+
+✅default config
+
+
+
+```bash
+# Qwen3.5-27B
+# https://huggingface.co/Qwen/Qwen3.5-27B
+vllm serve Qwen/Qwen3.5-27B \
+  --tensor-parallel-size 8 \
+  --max-model-len 262144 \
+  --reasoning-parser qwen3
+
+# Qwen3-Coder-Next
+# https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html#basic-multi-gpu-setup
+vllm serve Qwen/Qwen3-Coder-Next-FP8 \
+  --tensor-parallel-size 4 \
+  --enable-prefix-caching
+
+# Qwen3-235B-A22B-FP8
+# https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-FP8
+vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
+  --tensor-parallel-size 4 \
+  --max-model-len 262144
+
+# https://github.com/aliez-ren/vllm-qwen3.5-nvfp4-sm120?utm_source=chatgpt.com
+vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 \
+  --max-model-len 234567 \
+  --gpu-memory-utilization 0.89 \
+  --max-num-seqs 4 \
+  --max-num-batched-tokens 4096
+
+vllm serve Qwen/Qwen3.5-27B-FP8 \
+  --quantization fp8 \
+  --dtype auto \
+  --gpu-memory-utilization 0.85 \
+  --max-model-len 131072 \
+  --max-num-seqs 2 \
+  --max-num-batched-tokens 2048 \
+  --tensor-parallel-size 1
+
+# https://huggingface.co/Qwen/Qwen3.5-27B-FP8?utm_source=chatgpt.com
+vllm serve Qwen/Qwen3.5-27B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3
+```
+
+
+
+
+```bash
+
+# 跑 qwen27b batching
+# running on dash2
+bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/launch_qwen35_27b_tp2dp1_epoff_batching_chat0_32k_weekly_peak.sh
+
+
+# 跑 evaluator 的对比
+# running on dash1
+CASE_KIND=chat TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh
+
+CASE_KIND=coder TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh
+
+CASE_KIND=thinking_prefill TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh
+
+CASE_KIND=thinking_decode TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh
+
+
+
+# [x] 对 qwen35_27b，对齐线上 trace，search 0～4k threshold 后再测试
+./workflow threshold-search \
+  --hardware dash0123_8gpu \
+  --model qwen35_27b \
+  --engine internal \
+  --workload chat \
+  --phase prefill_decode \
+  --trace-type chat-0-4k \
+  --max-threshold 0.5
+
+
+# qwen3-coder-next 跑不了 EP?
+
+
+# [x] dash3 上的要重新放到 dash2 跑
+cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
+bash workflow_output/plans/dash0123_8gpu__qwen35_27b__internal__prefill_decode__chat__legal10_chat0_4k/run_results_v2_trace_dash3.sh --machine-label dash2
+
+
+# [x] 跑 qwen35_27b 0~32k
+./launch_qwen35_27b_chat_0_32k_after_trace_prepare.sh
+
+```
+
+## 可运行情况
+
+qwen-235b ✅
+qwen27b ✅
+qwen-coder：需要切换到对应的 vllm
+qwen-30b：需要切换到对应的 container 支持 flash-infer 的版本
+
+```
+pip install -U flashinfer-python
+flashinfer >= 0.7
+
+wjh@ds-f74814b6-1-65cd484875-256zt:~$ pip list | grep flashinfer
+flashinfer-cubin                         0.6.4
+flashinfer-jit-cache                     0.6.4
+flashinfer-python                        0.6.4
+
+```
+
+## 线上性能
+
+ qwen27b:
+ 40 instance:  Mean: 4.00 qps  Max: 5.67 qps
+ 
+ prefill: Mean: 19.3 万tpm  Max: 33.9 万tpm
+ decode: Mean: 7.24 万tpm  Max: 11.9 万tpm
+ first latency: Mean: 1.59 s  Max: 11.3 s
+tail latency: Mean: 23.6 s  Max: 46.2 s
+
+
+qwen30b-a3b:
+Mean: 0.00267 qps  Max: 0.109 qps
+
+
+
+## 模型
+
+名称：qwen3-235b-a22b版本：256k-0717
+名称：qwen3-235b-a22b版本：0717-eagle-0820
+
+qwen3-30b-a3b版本：1m-instruct-0726-fp4
+名称：qwen3-30b-a3b版本：1m-thinking-0728-fp4
+
+名称：qwen3-coder-next版本：1m-20260129-re-mtp-fp8-torch-dtype
+名称：qwen3-coder-next版本：1m-20260129-xml-tool-parser-fix
+
+名称：qwen3.5-27b版本：256k-0223-internal
+名称：qwen3.5-27b版本：256k-0223-internal-nvfp4-inputscale-fp8-attn
+
+
+```
+"cache_volume": {
+  "enabled": true,
+  "scope": "application"
+},
+"cpfs_file_system_id": "bmcpfs-290qtyip73f85z7zt9t"
+```
+
+
+## dashllm_cmd serving
+
+```
+[INFO] 2026-03-27 18:42:51,933869: {"message":"vllm engine_args: {'model': '/dev/shm/dashllm_model_2', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 1, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swap_space': 1, 'max_num_seqs': 256, 'max_num_batched_tokens': 4096, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'quantization': None, 'max_model_len': 262144, 'compilation_config': {'use_inductor': False, 'custom_ops': ['all']}, 'enable_prefix_caching': True, 'distributed_executor_backend': 'mp', 'enable_chunked_prefill': True, 'max_seq_len_to_capture': 262144}","time":"2026-03-27 18:42:51.933"}
+```
+
+
+
+
+## 阿里模型 env
+
+qwen3.5-27b 一定需要 BLADNN 来支持 vl attn kernel
+qwen3-30b/235b/coder 都可以在不使用 BLADNN 的情况下启动
+对于 235b/coder 这张已经 FP8 量化的模型，使用 BLADNN 会报错
+
+
+- qwen3-coder
+```
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
+```
+
+- qwen3.5-27b
+```
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
+```
+
+```bash
+####################################
+# Qwen3.5-27B
+####################################
+VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 1000000 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0
+
+VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 40960 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0
+
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --max-num-seqs 64 --max-num-batched-tokens 40960 #--long-prefill-token-threshold 30000 #--skip_mm_profiling --mm-processor-cache-gb 0
+
+#--long_context_threshold 30000
+#Qwen3_5ForConditionalGeneration
+
+
+####################################
+# Qwen3-Coder
+####################################
+VLLM_MOE_EXPERTS_OVERLAP=1 TORCH_CUDA_ARCH_LIST="9.0+PTX" VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
+# ok
+vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
+
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
+
+
+####################################
+# Qwen3-30B-A3B
+####################################
+# ok
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B --tensor-parallel-size 2
+
+
+####################################
+# Qwen3-235B-A22B
+####################################
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4
+# Ok
+vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4
+
+VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve resource/model/464482ce.qwen3-235b-a22b/128k-0426/ --tensor-parallel-size 4
+
+
+'{"gpu_memory_utilization": 0.9, "max_model_len": 262144, "enable_chunked_prefill": true, "enable_think": 1, "think_mode": "auto", "tensor_parallel_size": 1, "dtype": "bfloat16", "enforce_eager": false, "enable_prefix_caching": true, "mamba_cache_mode": "light", "distributed_executor_backend": "mp", "block_size": 64, "max_num_batched_tokens": 8192, "disable_cascade_attn": true, "speculative_config": {"method": "qwen3_next_vl_mtp", "num_speculative_tokens": 3}, "mm_processor_cache_gb": 0, "limit_mm_per_prompt": {"image": 256, "video": 64}, "compilation_config": {"cudagraph_mode": "FULL_AND_PIECEWISE", "use_inductor": false, "pass_config": {"fuse_norm_quant": false, "fuse_act_quant": false, "fuse_attn_quant": false}}, "mamba_cache_dtype": "float32", "skip_mm_profiling": true, "quantization": "fp8"}'
+```
+
+
+1 GPU: TP1DP1
+2 GPU: (TP2DP1, TP1DP2) x (EPON, EPOFF)
+4 GPU: (TP4DP1, TP2DP2, TP1DP4) x (EPON, EPOFF)
+8 GPU: (TP8DP1, TP4DP2, TP2DP4, TP1DP8) x (EPON, EPOFF)
+
+## E2E 测试
+
+【qwen3-coder】【0-30k】【kvs】【h20-96-d】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-nosparse-model/deployments/qwen3-coder-nosparse-model-ba4a
+
+【qwen3-coder-flash】【0-30k】【kvs】【h20-96-d】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-flash-2025-07-28-nosparse-model/deployments/qwen3-coder-flash-2025-07-28-nosparse-model-1553
+
+
+【qwen3-30b-a3b-instruct】【H20-96G-4】
+1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-instruct-2507-model/deployments/qwen3-30b-a3b-instruct-2507-model-a06c
+- 0.9.0
+
+【qwen3-30b-a3b-thinking】【H20-96G-4】
+1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-thinking-2507-model?spm=43a6e6f6.2e152c3f.0.0.6d4c103cudzmEy
+- 0.10.1rc2.dev397+g312aa870b
+
+【qwen3-235b-a22b-thinking】【P】【H20-96G-8】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-4945
+- 0.11.2.dev1732+gd694e5c71.d20251208
+
+【qwen3-235b-a22b-thinking】【D】【H20-96G-8】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode-21fd
+- 0.11.2.dev1732+gd694e5c71.d20251208
+
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.622e103cCLyFsA
+
+【qwen3.5-27b】【0-32k】【H20-96G-8】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-e277
+cuda128_cp312_test_vllm_87905ee0_20260222_202123
+0.13.0rc2.dev2067+g486e99474.d20260222.cu128
+
+【qwen3.5-27b】【0-32k】【5090-8】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-f462
+cuda129_cp312_test_vllm_11606
+0.13.0rc2.dev2111+gb44b43f43.d20260309
+
+【qwen3-coder-next】【0-32k】【H20-96G-8】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-e776?spm=43a6e6f6.5b0a3d6a.0.0.413d103cIwReWg
+- 0.10.2rc2.dev168+g8f0fc60c9.d20251204
+
+
+1. Hardware：5090, H20
+2. Model：Qwen3.5-27B, Qwen3-30B-A3B, Qwen3-235B-A22B-FP8, Qwen3-Coder-Next-FP8
+3. Trace: Chat, Thinking, Coder
+
+测试组合：
+Hardware 实验
+- 【qwen3.5-27b + 5090】
+- 【qwen3.5-27b + H20】
+Model 实验
+- 【qwen3.5-27b + H20】
+- 【qwen3-30b-a3b + H20】
+- 【qwen3-235b-a22b + H20】
+Trace 实验
+- 【qwen3-30b-a3b + H20 + Chat】
+- 【qwen3-30b-a3b + H20 + Thinking】
+- 【qwen3-235b-a22b + H20 + Chat】
+- 【qwen3-235b-a22b + H20 + Thinking】
+- 【qwen3-coder-next + H20 + Coder】
+
+
+
+---
+【qwen3-235b-a22b-instruct】【P】【8-32k】【H20-96G-8】
+1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-5966?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
+2. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-9f59?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
+- 0.13.0rc2.dev1948+g613d885a1.d20260108.cu128
+
+
+##  部署
+
+【qwen3-max-2026-01-23-chat-aa8c】qwen3-max nonthinking
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W
+
+
+
+【qwen3-max-2026-01-23-chat-9bf8】qwen3-max thinking
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ
+
+首包耗时：Mean: 2.93 s; Max: 9.27 s
+尾包耗时：Mean: 1.60 min; Max: 2.57 min
+[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 10.7 qps - Max: 20.0 qps
+
+[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.73 s - Max: 8.83 s
+
+[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.42 min - Max: 2.22 min
+
+
+【qwen3-max-qwenapp-crit-50e9】
+
+
+【qwen3-max-qwenapp-crit-decode】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-qwenapp-crit-decode?spm=43a6e6f6.29ced41b.0.0.4efb103cjsq29c
+
+Input: Mean: 9131 itpr ; Max: 10817 itpr
+Output: Mean: 823 otpr ; Max: 987 otpr
+
+weighted tps - Mean: 46.6 otpsr - Min: 44.5 otpsr - Max: 48.2 otpsr
+- Mean: 2.83 万tpm - Max: 3.48 万tpm
+
+tail: [ 2c3bc7a4 | cn-beijing ] - Mean: 35.0 s - Max: 1.40 min
+
+
+【qwen3-max-2025-10-30-thinking-model】
+Input: Mean: 6074 itpr ; Max: 16790 itpr
+Output: Mean: 2062 otpr Max: 4153 otpr
+
+
+
+1. Qwen3-Chat 【nonthinking】:
+qwen3-max-2026-01-23-chat-aa8c-info【v1: 包含 input/timestamp 等】
+qwen3-max-2026-01-23-chat-aa8c-info【v2: 可一次性采集一周，包含 output_length】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W
+2. Qwen3-Coder：
+qwen3-coder-next-model【包含 input/timestamp 等】
+qwen3-coder-next-model-8130-info【包含 output_length】
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-8130
+3. Qwen3-Chat 【thinking】：
+qwen3-max-2026-01-23-chat-9bf8
+https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ
+
+
+
+0319~0324: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-694a?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH
+0324~0326: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH
+0326+: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-3201?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH
+
--- a/projects/auto-tuner/ali-optimization.md
+++ b/projects/auto-tuner/ali-optimization.md
@@ -0,0 +1,95 @@
+## TL;DR
+
+从整理的与 Qwen 模型相关的优化 commit 可以看出：
+大部分优化点比较工程化（从常见的优化点方式出发：数据流优化、算子级优化、初始化、长上下文等等），哪里有钉子就哪锤头打哪，case by case 的优化某一模型在某一配置下存在的性能问题。
+
+可以看出仍然缺乏系统层的「自动优化」与「动态调优」，所有优化均为静态配置与人工调参（如：手写 fused_moe 的 json 配置、硬编码 warp/block 尺寸），优化主要针对已知 GPU 拓扑的静态 tuning，缺少基于 runtime profiling 的动态优化层。
+> 对比 DynamoLLM, NanoFlow, OrcaServe, AutoTP, MorphServe 已经探索了：自动并行拓扑搜索、异步调度重构与运行时自适应 FP8 策略
+
+线上的 workflow 更接近于：测试性能有问题 -> 找到 bottleneck -> 解决问题
+
+核心挑战：system 的经典问题，抽象的通用性与优化的定制性。
+做抽象之后：系统一致，容易得到通用性优化，但是针对每一模型可能达不到最优性能 
+不做抽象：每个模型都能灵活的在任意组件上手动调优性能，但成本高、难以通用
+
+对于 MoE 的优化：shared experts 计算与通信的重叠、kernel fuse
+
+## 模型优化点总结
+
+### 1. 并行化与数据流优化
+
+| 优化点                            | 适用模型                  | Commit(s)                                                              | 优化说明                                   | 模型特性出发点                   | 是否可以自动调优                               |
+| ------------------------------ | --------------------- | ---------------------------------------------------------------------- | -------------------------------------- | ------------------------- | -------------------------------------- |
+| Vision Data-Parallel 编码器路径     | Qwen2-VL, Qwen3-VL    | `c98be0a23`, `70b808fe1`, `3127274d0`                                  | 支持在视觉塔中关闭TP、改为DP运行                     | 视觉编码器张量大、TP通信过重           | 是，本质为 parallelism config search        |
+| Sequence-Parallel MoE dispatch | Qwen3-Next, Qwen3-MoE | `vllm/model_executor/models/qwen3_next.py:183`, `3127274d0`            | 令 tokens 在 TP rank 之间切分后再送 EP，防止重复专家调度 | DeepEP / TP×EP 并行导致重复计算   | 是，本质需要的是类似 DynamoLLM，根据 token 负载调整通信策略 |
+| Shared Fused MoE 重叠优化          | Qwen3-Next            | `shared_fused_moe.py`, `vllm/model_executor/models/qwen3_next.py:161`  | 避免重复计算共享专家，节省计算                        | Shared expert 与 EP 重叠浪费算力 | 是，本质属于 DBO 搜索的一环                       |
+| Fused MoE 内部 all-reduce        | Qwen3-MoE             | `4f510bc2`                                                             | 将 all-reduce 内嵌进专家执行阶段                 | TP>1 时额外一次 all-reduce 过慢  | 是，本质属于 DBO 搜索的一环<br>                   |
+| 非阻塞数据流 + pinned buffer         | Qwen3-VL, Qwen2.5-VL  | `b2155ed31`, `2c1c7dfb3`, `0426e3c5e`, `67da5720d4`, `e283976f3`       | 主机异步构建 seqlens 并异步拷贝到 GPU              | 避免 cudaSync 阻塞，多帧视频管线更流畅  | 否，取决于 runtime benchmark 观测 H2D/D2H 延迟  |
+| DeepEP 通信修正 (TP×EP)            | Qwen3-Next, Qwen3-MoE | `vllm/model_executor/models/qwen3_next.py:183`, `qwen3_moe.py:139,192` | 消除 EP 重复调度，避免多余 all-to-all             | 多维并行模式中重复专家调用             | 是，本质属于 DBO 搜索的一环                       |
+
+
+### 2. 内核与算子级优化
+
+| 优化点                            | 适用模型                     | Commit(s)                             | 优化说明                            | 模型特性出发点                     | 是否可以自动调优                    |
+| ------------------------------ | ------------------------ | ------------------------------------- | ------------------------------- | --------------------------- | --------------------------- |
+| fast_pos_embed_interpolate 向量化 | Qwen3-VL                 | `30d08911f`, `af7dfb0d1`, `a6049be7`  | 将 Python 循环替换为 meshgrid 张量操作    | 大图像/视频分辨率下插值耗时过高            | 否，过于 specific               |
+| Triton Interleaved MRoPE 核     | Qwen3-VL                 | `cea91a32f`, `3127274d0`, `c242c9803` | 用 Triton kernel 实现交织 3D RoPE    | 视觉-时序交错嵌入需 GPU 融合旋转         | 否                           |
+| Fused RMSNorm 替代多次 norm        | Qwen3 dense / MoE / Next | `f80ae5bd`, `82e64c7`                 | RMSNorm 融合为单 kernel 以减少 launch  | 长上下文下 norm 成为热点             | 是，类似 NanoFlow 等可以自动搜索进行算子融合 |
+| O(n) inverse permutation       | Qwen2.5-VL               | `67da5720d4`, `e283976f3`             | 取代 argsort 排序以降低 O(n log n) 复杂度 | 视觉窗口注意力频繁重排                 | 否                           |
+| Bool-mask → index_select       | Qwen3-Next               | `785d8b6`                             | 改为纯 GPU 索引避免 host copy          | MTP 多 token 预测频繁索引          | 否                           |
+| FP8 batched expert kernels     | Qwen3-MoE                | `compressed_tensors_moe.py:937,991`   | 自动选择 FP8 Cutlass / Triton 专家核   | MoE 中 expert 众多需 batched 执行 | 通用                          |
+| LayerNorm tile 化 与 SM cache    | Qwen3-Next               | `82e64c7`                             | Triton LN 按行块 tile 计算           | 减少 kernel launch + 提升 占用率   | 通用                          |
+
+### 3. 精度与存储路径优化（FP8 / 量化 / KV Cache）
+
+| 优化点                  | 适用模型                  | Commit(s)                                                                                                          | 优化说明                                        | 模型特性出发点                         | 是否可以自动调优        |
+| -------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------- | ------------------------------- | --------------- |
+| FP8 KV-Cache 存储      | Qwen2                 | `da971ec7`                                                                                                         | 允许 FP8 缓存 KV 对                              | 长上下文 KV 占显存大                    | 是               |
+| FP8 KV-Scale 重映射     | Qwen2 MoE             | `bd4397352`                                                                                                        | 修正 FP8 缓存比例 加载                              | 防止量化漂移                          | 是               |
+| 分离 QKVZ / BA 投影      | Qwen3-Next            | `ef7eefe1` (`2025-09-18`)                                                                                          | 拆分 in-proj 以支持 FP8 checkpoint               | FP8 blockwise 加载需结构匹配           | 否               |
+| FP8 精度 guard 修正      | Qwen3-MoE             | `a258ad8b`                                                                                                         | 调整量化 scale 计算                               | FP8 精度漂移                        | 工程实践            |
+| 4-bit bnb 预量化加载      | Qwen3-MoE             | `bitsandbytes_loader.py:467`                                                                                       | 支持 4bit BNB 权重                              | 降低权重存储带宽                        | trivial         |
+| FP8 / Fused MoE 配置矩阵 | Qwen3-Next, Qwen3-MoE | `238c4c17`, `482e52f56`, `75334956c`, `9f04d9d55`, `12a8414d8`, `f82f7a899`, `7a70a7189`, `569bf1c9c`, `c733bd5e8` | 针对 GB200 / H200 / H100 等 GPU 提供 FP8 调参 json | 不同 GPU SM 结构 差异大需 warp/block 适配 | 离线 profile 进行调优 |
+| ROCm FP8 配置 (MI300X) | Qwen3 / MoE           | `2007d4d5`, `f5a3c655`                                                                                             | ROCm 专用 Triton 块 配置                         | 兼容 AMD 栈                        | 工程实践            |
+
+### 4. 初始化与加载
+
+| 优化点                                         | 适用模型         | Commit(s)              | 优化说明                    | 模型特性出发点           | 是否可以自动调优 |
+| ------------------------------------------- | ------------ | ---------------------- | ----------------------- | ----------------- | -------- |
+| Max-token heuristics                        | Qwen2/2.5-VL | `2c5302fad`            | 通过启发式计算最大 token 代替伪输入   | 启动时避免生成假图像        | 否        |
+| Cached profiling inputs + fast HF processor | Qwen2-VL     | `1298c677`, `d49adea1` | 缓存启动探测 数据 以减少 初始化       | 模型启动耗时高           | trivial  |
+| Rotary dispatch abstraction (CUDA/ROCm)     | Qwen series  | `5e4a8223c`            | 动态选择后端 FlashAttn kernel | 兼容 ROCm 与 CUDA 堆栈 | 工程实践     |
+
+### 5. 推理路径与长上下文优化
+
+| 优化点                             | 适用模型        | Commit(s)                                                      | 优化说明                                | 模型特性出发点                | 是否可以自动调优                 |
+| ------------------------------- | ----------- | -------------------------------------------------------------- | ----------------------------------- | ---------------------- | ------------------------ |
+| Dual-chunk attention            | Qwen3 dense | `qwen3.py:118,199`                                             | 支持 >128K 上下文 分块 KV                  | 长上下文 KV 膨胀             | 是，根据负载在线自动决定是否切分         |
+| Gated DeltaNet linear attention | Qwen3-Next  | `vllm/model_executor/models/qwen3_next.py:206`, `1266`, `1292` | 融合 conv + recurrent 层 线性化 attention | Prefill 阶段 计算 O(n²) 太高 | 否，需要结合模型调优选择合适的算法        |
+| Mamba-style state cache         | Qwen3-Next  | `1218`, `vllm/model_executor/layers/mamba/abstract.py:50`      | 状态缓存 高效 布局 + 允许 speculative decode  | GDN/Mamba 混合 需要 状态 重用  | 否                        |
+| Multi-Token Prediction (MTP)    | Qwen3-Next  | `785d8b6` (相关 MTP 路径)                                          | 重用 decoder 层 用于草稿 token 预测          | 提升 spec decode 吞吐      | 是，根据 metrics 自动决定 MTP 深度 |
+| Speculative metadata 构建         | Qwen3-Next  | `gdn_attn.py:22,61` `gpu_model_runner.py:1374`                 | 预建 元数据 避免 draft 接受 重算               | 减少 prefill 延迟          | 工程实践                     |
+
+### 6. 多模态视觉流水线优化
+
+| 优化点                           | 适用模型                  | Commit(s)                                          | 优化说明                      | 模型特性出发点           | 是否可以自动调优               |
+| ----------------------------- | --------------------- | -------------------------------------------------- | ------------------------- | ----------------- | ---------------------- |
+| reshape 替代 concat 拼接          | Qwen3-VL, Qwen2.5-VL  | `0426e3c5`, `2c1c7dfb`                             | 减少内存重新分配                  | 图像批次拼接昂贵          | 是，图优化                  |
+| 缓存 vision dims 与 deepstack 拆分 | Qwen3-VL              | `1dfea5f4`                                         | 避免重复 .contiguous() 与 维度计算 | 多尺度视觉特征 频繁分块      | 是，runtime shape memory |
+| Flash / xFormers / SDPA 适配    | Qwen2.5-VL / Qwen3-VL | `02ed8a1fb`, `70b808fe1`, `47c712621`, `c242c9803` | 统一不同 attention 后端         | 不同 GPU 和 视频 长度 需求 | 工程实践                   |
+| Rotary window pipeline GPU 重写 | Qwen2.5-VL            | `67da5720d4`, `e283976f3`                          | 预建 窗口 索引，减少 cudaMemcpy    | 重复 CPU→GPU 拷贝     | 否                      |
+| Memoized seqlens 缓存           | Qwen2-VL / Qwen3-VL   | `70b808fe1`, `3127274d0`                           | 重用序列长度 元数据                | 视频帧结构 重复 计算多      | 工程实践                   |
+
+---
+
+### 总结视图（跨类对照）
+
+| 优化类别    | 代表模型                               | 核心收益                  | 代表 Commits                                      |
+| ------- | ---------------------------------- | --------------------- | ----------------------------------------------- |
+| 并行化与数据流 | Qwen3-VL / Qwen3-Next / Qwen3-MoE  | 异步、少通信、高并发            | `b2155ed31`, `0426e3c5e`, `3127274d0`           |
+| 内核与算子   | Qwen3-VL / Qwen3-Next / Qwen2.5-VL | GPU 融合计算              | `30d08911f`, `cea91a32f`, `82e64c7`             |
+| 精度与存储   | Qwen2 / Qwen3-MoE / Next           | FP8 高效推理              | `da971ec7`, `bd4397352`, `ef7eefe1`, `238c4c17` |
+| 初始化加载   | Qwen2-VL / 全系                      | 快速启动 / 多后端            | `2c5302fad`, `5e4a8223c`                        |
+| 推理优化    | Qwen3-Next / dense                 | 线性化注意力、Spec Decode 加速 | `785d8b6`, `1266`, `1374`                       |
+| 视觉流水线   | Qwen2.5-VL / 3-VL                  | GPU 端视频处理 吞吐          | `0426e3c5`, `67da5720d4`, `1dfea5f4`            |
+| 跨平台     | Qwen3-Next / 全系                    | ROCm / Blackwell 兼容   | `qwen3_next.py:306`                             |
--- a/projects/auto-tuner/codex-problems.md
+++ b/projects/auto-tuner/codex-problems.md
@@ -0,0 +1,24 @@
+测试的 config 极为受限，会在一个 knob 失效时就完全放弃
+
+例如：TP8 一次失败，就不再测试，但是 TP8 + EP 是可以跑的，而且效果好。说明 codex 还没有完全的 engine 配置的理解能力
+
+```
+- 真正的“为什么不能用 TP=8”，是在后面被明确写出来的：[codex_tuning_v2.jsonl (line 270)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 和 [codex_tuning_v2.jsonl (line 272)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#) 说明 TP=8 在这个 FP8 的 Qwen3-MoE checkpoint 上会让 TP-sharded MoE gate/up output size = 192，而 FP8 权重量化要求的 block size 是 128，192 不能整除 128，所以模型初始化阶段不兼容。
+- 这个结论在最终总结里又重复了一次：[codex_tuning_v2.jsonl (line 1154)](https://file+.vscode-resource.vscode-cdn.net/Users/gahow/.vscode/extensions/openai.chatgpt-26.304.20706-darwin-arm64/webview/#)。
+```
+
+  ```
+  | Rank | Exp | Config | Tput/GPU | TTFT p95 | SLO pass |
+  |---|---:|---|---:|---:|---:|
+  | 1 | 8 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724839 | 1.337592 | 97.58% |
+  | 2 | 13 | tp4 b16 bt16384 s64 gm0.92 pc on ep off | 1448.724767 | 1.328592 | 97.58% |
+  | 3 | 16 | tp4 b16 bt12288 s48 gm0.92 pc on ep off | 1448.722459 | 1.346786 | 97.75% |
+  | 4 | 14 | tp4 b32 bt16384 s64 gm0.92 pc on ep off | 1448.722451 | 1.382931 | 97.58% |
+  | 5 | 12 | tp4 b16 bt32768 s128 gm0.92 pc on ep off | 1448.719743 | 1.324819 | 97.58% |
+  | 6 | 11 | tp4 b16 bt8192 s32 gm0.92 pc on ep off | 1448.718885 | 1.314879 | 97.83% |
+  | 7 | 15 | tp4 b16 bt16384 s64 gm0.95 pc on ep off | 1448.715778 | 1.368400 | 97.50% |
+  | 8 | 17 | tp4 b16 bt16384 s64 gm0.92 pc on ep on | 1448.714795 | 1.864526 | 95.58% |
+  | 9 | 10 | tp4 b16 bt16384 s64 gm0.92 pc off ep off | 1448.437961 | 1.764754 | 95.50% |
+  | 10 | 9 | tp2 b16 bt16384 s64 gm0.92 pc on ep off | startup failed | - | - |
+  ```
+
--- a/projects/auto-tuner/draft.md
+++ b/projects/auto-tuner/draft.md
@@ -0,0 +1,210 @@
+### 4.X Load-Dependent Optimal Tensor Parallelism under a Fixed GPU Budget
+
+A central question in our setting is why the preferred tensor parallelism (TP) changes with workload intensity, even when the model and hardware remain unchanged. Our key observation is that TP controls two fundamentally different quantities at the same time: the runtime of a single prefill batch, and the number of replicas that can be deployed under a fixed GPU budget. This creates a regime-dependent tradeoff between per-request latency and cluster-level service capacity.
+
+#### Setup
+
+We consider a cluster with a fixed total GPU budget $G$. Choosing tensor parallelism degree $t$ implies that each model replica consumes $t$ GPUs, so the number of deployable replicas is
+
+$$
+m_t = \frac{G}{t}.
+$$
+
+We focus on the prefill stage, since it dominates TTFT in our traces and is the primary source of the performance shift observed in Figure~X. For a prefill batch $b$ executed on one replica, let $\mathcal{B}_b$ denote the set of requests in the batch, and let $x_i$ be the prompt length of request $i \in \mathcal{B}_b$. The total number of prefill tokens in the batch is
+
+$$
+Z_b = \sum_{i \in \mathcal{B}_b} x_i.
+$$
+
+The TTFT of a request is determined not by an isolated request-level service time, but by the runtime of the prefill batch it belongs to and the waiting time before that batch starts. Accordingly, we model the system at the batch level.
+
+#### Batch Runtime Model
+
+Let $T_b(t)$ denote the runtime of prefill batch $b$ under TP degree $t$. We decompose it as
+
+$$
+T_b(t) = T_b^{\mathrm{comp}}(t) + T_b^{\mathrm{comm}}(t) + T_b^{\mathrm{rt}}(t),
+$$
+
+where $T_b^{\mathrm{comp}}(t)$ is the operator compute time, $T_b^{\mathrm{comm}}(t)$ is the TP communication cost, and $T_b^{\mathrm{rt}}(t)$ captures remaining runtime overheads such as launch gaps and executor overhead.
+
+For decoder-only Transformers, the prefill FLOPs of a request with prompt length $x$ can be approximated as
+
+$$
+F(x) = a x d^2 + b x^2 d,
+$$
+
+where $d$ is the hidden dimension, the $a x d^2$ term captures dense projections and MLPs, and the $b x^2 d$ term captures self-attention. For a batch $b$, the total FLOPs are therefore
+
+$$
+F_b = \sum_{i \in \mathcal{B}_b} F(x_i)
+= a d^2 \sum_{i \in \mathcal{B}_b} x_i + b d \sum_{i \in \mathcal{B}_b} x_i^2.
+$$
+
+This form is important because the batch cost depends not only on the total token count $Z_b$, but also on the second moment of request lengths. Using
+
+$$
+\sum_{i \in \mathcal{B}_b} x_i^2
+= n_b \left( \bar{x}_b^2 + \mathrm{Var}_b(x) \right),
+$$
+
+where $n_b = |\mathcal{B}_b|$ and $\bar{x}_b$ is the batch mean prompt length, we see that higher within-batch length variance directly increases the physical prefill cost, even when the mean length is fixed.
+
+We then express compute time using a roofline-style model:
+
+$$
+T_b^{\mathrm{comp}}(t)
+=
+\max
+\left\{
+\frac{F_b}{t \Pi_1 \eta_t^{\mathrm{comp}}},
+\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
+\right\},
+$$
+
+where $\Pi_1$ and $B_1$ are the effective single-GPU compute and memory bandwidth, $Q_b$ is the batch memory traffic, and $\eta_t^{\mathrm{comp}}, \eta_t^{\mathrm{mem}} \le 1$ are TP efficiency terms. In the ideal case, compute time would decrease as $1/t$; in practice, the decrease is only sublinear due to degraded kernel efficiency and shape effects.
+
+TP also introduces collective communication overhead. Approximating each collective with a ring all-reduce over payload size $n$, its cost is
+
+$$
+T_{\mathrm{AR}}(n,t)
+=
+2 (t-1)\alpha
+
+2 \frac{t-1}{t} \frac{n}{\beta},
+$$
+
+where $\alpha$ is the per-hop latency and $\beta$ is the effective link bandwidth. If each layer performs, on average, $k$ such collectives and the activation payload is proportional to batch token volume, $n_b \approx q d Z_b$, then the communication term becomes
+
+$$
+T_b^{\mathrm{comm}}(t)
+\approx
+L k
+\left(
+2 (t-1)\alpha
+
+2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
+\right),
+$$
+
+where $L$ is the number of layers and $q$ is the bytes per element. Unlike compute, this term does not scale down with $1/t$; instead, it grows with the TP degree and the batch size.
+
+Combining the two, the prefill batch runtime is
+
+$$
+T_b(t)
+=
+\max
+\left\{
+\frac{
+a d^2 \sum_i x_i + b d \sum_i x_i^2
+}{
+t \Pi_1 \eta_t^{\mathrm{comp}}
+},
+\frac{Q_b}{t B_1 \eta_t^{\mathrm{mem}}}
+\right\}
+
+L k
+\left(
+2 (t-1)\alpha
+
+2 \frac{t-1}{t} \frac{q d Z_b}{\beta}
+\right)
+
+T_b^{\mathrm{rt}}(t).
+$$
+
+This equation makes the TP tradeoff explicit: larger TP reduces the compute-dominated component of a single batch, but it also introduces collective cost and provides only sublinear speedup.
+
+#### From Batch Runtime to TTFT
+
+For a request $i$ assigned to replica $r$, let $b(i)$ denote the prefill batch that eventually serves it. Its TTFT can be written as
+
+$$
+\mathrm{TTFT}_i(t) = W_{q,i}(t) + T_{b(i)}(t),
+$$
+
+where $W_{q,i}(t)$ is the waiting time before the batch starts. More precisely,
+
+$$
+W_{q,i}(t) = R_i(t) + \sum_{u \in \mathcal{H}_i} T_u(t),
+$$
+
+where $R_i(t)$ is the residual runtime of the currently executing batch at arrival, and $\mathcal{H}_i$ is the set of batches queued ahead of request $i$ on the same replica.
+
+Under a renewal approximation, the expected residual time is
+
+$$
+\mathbb{E}[R_t] = \frac{\mathbb{E}[T_b(t)^2]}{2 \mathbb{E}[T_b(t)]}.
+$$
+
+Thus, tail TTFT is affected not only by the mean batch runtime, but also by its second moment. This is crucial for heterogeneous workloads: higher length variance increases both $T_b(t)$ and $\mathbb{E}[T_b(t)^2]$, which amplifies queueing tails.
+
+#### Cluster Capacity under a Fixed GPU Budget
+
+The key systems-level quantity is not the latency of one replica, but the total service capacity of the cluster under fixed $G$. Let
+
+$$
+\mu_{t,w}
+=
+\mathbb{E}
+\left[
+\frac{Z_b}{T_b(t)}
+\mid w
+\right]
+$$
+
+denote the average prefill token throughput of a single replica under workload window $w$. Since the cluster can host only $m_t = G/t$ replicas, the aggregate prefill capacity is
+
+$$
+\Lambda_{t,w} = \frac{G}{t} \mu_{t,w}.
+$$
+
+This expression exposes the central tradeoff. Increasing TP may improve single-replica throughput $\mu_{t,w}$ by reducing batch runtime, but it simultaneously reduces the replica count by a factor of $t$. Therefore, larger TP improves cluster-wide capacity only if
+
+$$
+\mu_{t,w} > t \mu_{1,w},
+$$
+
+i.e., if the per-replica throughput gain is superlinear in $t$. In practice, this is rarely achievable: compute speedup is at best linear, while communication and runtime overheads make it strictly sublinear. As a result, under saturation, larger TP usually cannot improve total cluster capacity and often reduces it.
+
+#### Regime Shift: Why Large TP Helps at Low Load but Hurts at High Load
+
+This model immediately explains the phase transition in our measurements. Let $\lambda_w$ denote the offered prefill-token arrival rate of workload window $w$. The observed goodput per GPU is approximately
+
+$$
+g_{t,w}^{\mathrm{obs}}
+\approx
+\frac{1}{G}
+\min \{ \lambda_w, \Lambda_{t,w} \}.
+$$
+
+In the light-load regime, where
+
+$$
+\lambda_w \ll \Lambda_{t,w},
+$$
+
+all TP choices have sufficient cluster capacity, so the observed goodput per GPU is nearly identical across TP settings. In this regime, TTFT is dominated by the runtime of a single prefill batch, and larger TP is beneficial because it reduces $T_b(t)$.
+
+In contrast, in the high-load regime, where
+
+$$
+\lambda_w \approx \Lambda_{t,w},
+$$
+
+the system becomes capacity- and queueing-limited. Since larger TP typically lowers $\Lambda_{t,w}$, it pushes the system closer to saturation, increasing queue depth and amplifying tail TTFT. Consequently, smaller or intermediate TP becomes preferable because it provides more replicas, higher aggregate concurrency, and lower queueing pressure.
+
+The model also explains why the effect is stronger for heterogeneous workloads such as coder traces. Because batch cost depends on $\sum_i x_i^2$ rather than only $\sum_i x_i$, workloads with larger prompt-length variance induce larger batch runtime variance, which increases both execution time and residual waiting time. Such workloads therefore enter the queueing-dominated regime earlier, causing the optimal TP to shift toward smaller values at lower offered load.
+
+#### Implication
+
+The above analysis suggests that TP should not be tuned as a static model-specific constant. Instead, the preferred TP is workload-regime dependent: larger TP is favored in service-time-dominated regimes, while smaller or intermediate TP is favored in queueing-dominated regimes. This observation is the basis for our tuner design: rather than searching TP blindly, we first infer which regime the workload belongs to, and then restrict the TP search to the region consistent with the predicted latency-capacity tradeoff.
+
+
+
+
+
+
+
+
--- a/projects/auto-tuner/eval
+++ b/projects/auto-tuner/eval
@@ -0,0 +1,308 @@
+## TODO
+
+- [x] 什么是 peak/valley，补一个 1 一天的 workload 波动
+- [ ] 需要跨小时（+1h）的 trace 相似性
+
+- [ ] 「需要 tune 多久？prefix trace window 的有效性」：这一部分需要跑一个 30min 的实验
+
+- [ ] evaluator-reliability-compare 加一个 valley setup 的比较 [5/10]
+
+
+```bash
+# probe qwen235b 的 trace sample threshold
+# [x] cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle && ./start_qwen235b_tp4dp1_threshold_dash0123_tmux.sh
+
+
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_refresh_step1_dash0123_tmux.sh
+# 等 step1 全部完成后
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_threshold_experiment_step2_dash0123_tmux.sh
+# 上面 qwen235b 10 parallel configs 的实验已经跑完，汇总观察到了 paper/workload_pattern_to_config_principles.md
+
+
+
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_tmux.sh
+# 跑完 merge
+bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/tmp/qwen30b_0311_peak_valley_19parallel_threshold_chat_dash0123_ts100/merge_results_v2_trace_tables.sh
+
+# TODO: 让 codex 跑 threshold 版本的 synthetic/semi-real 
+# done
+
+# Ongoing
+# [x] bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/start_qwen235b_0311_peak_threshold_batching_tp4dp1_epoff19_dash0123_tmux.sh
+
+
+# decode-only
+# TBD
+MODEL=qwen30b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
+
+# going
+MODEL=qwen235b bash ./start_decode_peak_parallel_sweep_dash0123_tmux.sh
+
+
+# Ongoing
+# [x] bash ./start_decode_peak_thresholds_dash0123_tmux.sh
+# MODEL=qwen30b bash ./search_decode_peak_thresholds.sh
+# MODEL=qwen235b bash ./search_decode_peak_thresholds.sh
+
+
+
+#### TBD
+# qwen3-coder 的 19 parallel configs smoke test
+cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
+tmux new -s codernext_parallel_smoke 'bash ./start_qwen_coder_next_internal_parallel_only_smoke.sh'
+
+```
+
+
+qwen30b:
+chat: 19 parallel configs x 5 days
+coder: 19 parallel configs x 5 days
+
+qwen235b:
+chat: 10 parallel configs x 5 days
+coder: 10 parallel configs x 5 days
+
+
+✅：TP4DP1EPOFF, TP8DP1EPOFF
+
+TP4DP2EPOFF
+
+
+- L 现在是 3 维
+    - log_mean_raw_len
+    - log_p95_over_mean_raw_len
+    - cv_raw_len
+- C 现在是 4 维
+    - log_mean_hit_len
+    - log_p95_over_mean_hit_len
+    - cv_hit_len
+    - cache_saving_ratio
+- A 现在是 3 维
+    - log_qps
+    - cv_interarrival
+    - log_fano_1s_request_counts
+
+
+## 一句话 principles
+
+prefill 时总是应该开 DP=1
+prefill 时几乎 EP off > EP on，需要找到 EP on > EP off 的 case 分析
+
+
+## prefill node
+
+
+- 证明 config 对性能影响大/不同 workload 性能最优点不同：`data/qwen30b-config-performance-spread-v1.csv`
+
+qwen30b
+361 configs (19 parallel x 19 batching (LPT=20480/32768))
+chat/coder 0311 10:00~10:03
+timescale/GPU=0.5
+linear SLO 0.001L + 1.0
+
+
+- workload 的跨天相似性：`data/weekday-workload-similarity.csv`
+
+0311~0317 5 weekdays
+chat/coder, peak/valley (10:00~10:30/22:00~22:30)
+
+
+- tuned top5 configs 与未来的相似度，`data/qwen30b-high-perf-configs-jaccard.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- tuned best configs 在未来能达到相对 oracle 的性能，`data/qwen30b-tuned-best-config-perf-across-5days.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- synthetic/semi-real/real 对比，`data/qwen30b-evaluator-reliability-compare.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak (10:00~10:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=2s;4s;6s
+
+
+- 需要 tune 多久？prefix trace window 的有效性 `data/qwen30b-prefix-trace-window-stability.csv`
+
+qwen30b
+19 configs (parallel)
+0311~0317 5 weekdays, chat/coder, peak/valley (10:00~10:10/22:00~22:10)
+timescale/GPU=0.5
+SLO: 0;8k;32k;=1s;2s;4s
+
+## decode node
+
+
+TBD
+
+
+
+
+
+
+
+## Similarity 计算
+
+
+这张图的算法，直接看这两个 Python 文件就行：
+
+- 主脚本: [plot_similarity_heatmap_custom_windows.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py)
+- 归一化定义: [compute_signatures.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py)
+- trace 排序与 catalog: [trace_catalog.py](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py)
+
+如果你问的是图里这张 `Similarity: 10:00-10:30 / 22:00-22:30`，核心算法就在：
+- 指标提取: [plot_similarity_heatmap_custom_windows.py:132](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L132)
+- 全局 robust normalization: [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)
+- 相似度矩阵: [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)
+
+**它实际怎么算**
+1. 枚举所有 trace
+   来自 [trace_catalog.py:77](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/trace_catalog.py#L77)，会收集：
+   - `chat peak`
+   - `chat valley`
+   - `coder peak`
+   - `coder valley`
+   并按 `trace_family, date, day_part` 排序。
+
+2. 对每个 trace，截取指定窗口
+   在 [plot_similarity_heatmap_custom_windows.py:179](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L179)。
+   对 peak trace 用你传的 `10:00-10:30`，对 valley trace 用 `22:00-22:30`。
+
+3. 对每个窗口算 5 维原始特征
+   在 [plot_similarity_heatmap_custom_windows.py:168](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L168)：
+   - `load_tokens_per_s = total_input_tokens / total_duration_seconds`
+   - `mean_input_length = mean(input_lengths)`
+   - `p95_input_length = quantile(input_lengths, 0.95)`
+   - `input_length_cv = std(input_lengths) / mean_input_lengths`
+   - `burstiness = std(inter_arrivals) / mean(inter_arrivals)`
+
+4. 把所有窗口放在一起，按每一维做全局 robust normalization
+   在 [compute_signatures.py:64](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/compute_signatures.py#L64)：
+   - 对每个维度 `x`
+   - 计算所有窗口上的 `median, q1, q3`
+   - `iqr = q3 - q1`
+   - `global_z_x = (x - median) / iqr`
+   - 如果 `iqr <= 0`，就强制设为 `1.0`
+
+5. 每个窗口得到一个 5 维 normalized 向量
+   在 [plot_similarity_heatmap_custom_windows.py:248](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L248)：
+   ```python
+   [
+     global_z_load_tokens_per_s,
+     global_z_mean_input_length,
+     global_z_p95_input_length,
+     global_z_input_length_cv,
+     global_z_burstiness,
+   ]
+   ```
+
+6. 两两计算欧氏距离，再映射成相似度
+   在 [plot_similarity_heatmap_custom_windows.py:247](/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workload-compare/plot_similarity_heatmap_custom_windows.py#L247)：
+   - `distance(i, j) = ||v_i - v_j||_2`
+   - `similarity(i, j) = exp(-distance(i, j))`
+
+所以这张图的数学定义就是：
+
+```text
+G(w) = [
+  z(load_tokens_per_s),
+  z(mean_input_length),
+  z(p95_input_length),
+  z(input_length_cv),
+  z(burstiness),
+]
+
+z(x) = (x - median(x_all_windows)) / IQR(x_all_windows)
+
+d(a,b) = ||G(a) - G(b)||_2
+
+sim(a,b) = exp(-d(a,b))
+```
+
+这里 `x_all_windows` 指的是这次参与画图的全部窗口，不是单个 window 内部。
+
+**伪代码**
+```python
+specs = build_trace_catalog()  # chat/coder x peak/valley x all days
+
+rows = []
+for spec in specs:
+    if spec.day_part == "peak":
+        window = peak_window   # 10:00-10:30
+    else:
+        window = valley_window # 22:00-22:30
+
+    reqs = load_requests_in_window(spec.trace_path, window)
+
+    input_lengths = [r.input_length for r in reqs]
+    inter_arrivals = diff([r.timestamp for r in reqs])
+
+    row = {
+        "trace_family": spec.trace_family,
+        "day_part": spec.day_part,
+        "date": spec.date,
+        "load_tokens_per_s": sum(input_lengths) / window_duration_sec,
+        "mean_input_length": mean(input_lengths),
+        "p95_input_length": p95(input_lengths),
+        "input_length_cv": std(input_lengths) / mean(input_lengths),
+        "burstiness": std(inter_arrivals) / mean(inter_arrivals),
+    }
+    rows.append(row)
+
+frame = DataFrame(rows)
+
+for col in [
+    "load_tokens_per_s",
+    "mean_input_length",
+    "p95_input_length",
+    "input_length_cv",
+    "burstiness",
+]:
+    med = median(frame[col])
+    iqr = p75(frame[col]) - p25(frame[col])
+    if iqr <= 0:
+        iqr = 1.0
+    frame[f"global_z_{col}"] = (frame[col] - med) / iqr
+
+vectors = frame[[global_z_5_dims]].to_numpy()
+
+for i in range(len(vectors)):
+    for j in range(len(vectors)):
+        dist[i,j] = l2_norm(vectors[i] - vectors[j])
+        sim[i,j] = exp(-dist[i,j])
+
+plot_heatmap(sim)
+```
+
+**补充两点**
+- 这张图不是用 `SIGNATURE_WEIGHTS` 画的。`0.35/0.2/...` 那组权重只用于 `signature_score`，不用于 heatmap 两两相似度。
+- 这张图用的是 `global_robust_scale`，不是按 `chat/coder` 分开缩放，也不是每个窗口自己内部缩放。
+
+如果你要，我下一步可以直接把这张图对应的命令行也给你写出来，并把它产出的 `metrics csv / normalization csv / similarity matrix csv` 一一对应到文件。
+
+
+## Semi-real 定义
+
+- 到达过程：按 source trace 的平均 req_rate 生成 Poisson arrivals  
+    代码在 prepare_figure08_evaluator_assets.py (line 115) 和 prepare_figure08_evaluator_assets.py (line 537)
+- 请求数：不固定，取决于这次 Poisson 过程实际生成了多少个 arrival
+- 长度分布：对每个生成的 arrival，从 source real trace 的请求列表里“有放回随机采样”一个 source request  
+    代码在 prepare_figure08_evaluator_assets.py (line 272)
+- 被继承的字段：input_length、output_length、turn/type
+- 不保留原请求身份：hash_ids 会重新生成，不是原 real trace 的 hash_ids  
+	    代码在 prepare_figure08_evaluator_assets.py (line 169)
+
+
+
--- a/projects/auto-tuner/outline.md
+++ b/projects/auto-tuner/outline.md
@@ -0,0 +1,421 @@
+# Draft
+
+挑战：
+1. workload 的 benchmark 性能与线上的 gap，如何最小化这一 gap，从而让 offline benchmark 数据能够在指导线上的资源配置时最小化误差
+2. 使用 AI tuner 时如何让 LLM 有一条清晰的优化目标，而不是随意的 propose config 也无法判断优劣
+3. 如何尽量减少 evolve 时需要的 GPU 卡时，减少资源需求，避免 AI tuner 成本甚至大于人力成本
+
+
+我们的机制：
+1. workload 的抽象定义，需要能够用于解释说明为什么不同 workload 下不同 setup 的表现不同
+2. 自动化的异构硬件资源池内的 config/model 调整方案
+
+
+
+
+# 11. Related Work (结构建议：按“范式”而非按领域罗列)
+
+**11.1 Inference autotuning**
+
+- Black-box optimization for serving configs
+    
+- Model-based fast search (operation-level / database-driven)  
+    **TODO:** insert concrete citations and differences
+    
+
+**11.2 Agentic tuning for training/storage/PFS**
+
+- StorageXTuner / ASAP / STELLAR line
+    
+- Position: our novelty is inference-specific feasibility + verifiable loop + drift reuse
+    
+
+**11.3 Systems optimization methodology**
+
+- measurement-driven tuning, adaptive control, safe exploration
+
+
+
+
+---
+# GPT
+
+## 一、把问题抽象成“论文级别”的 Research Problem
+
+在顶会视角里，这个问题不是“压测不准”，而是：
+
+> 在**高度异构硬件 + 多模型混部 + 动态调度**的 LLM 推理集群中，
+> 怎样在满足多级 SLO（P95/P99 延迟）的前提下，**系统化地完成容量规划与资源编排**，同时让 offline benchmark 与 online 行为之间的误差可控？
+
+可以总结出 **3–4 个核心挑战（Challenges）**，每个挑战都要能撑得起一个小机制：
+
+### Challenge 1：Workload 维度极高，传统 trace / benchmark 已经不够用了
+
+* 传统 web/microservice workload：请求大小 / CPU 时间分布 + 到达过程，维度很低。
+* LLM workload 至少包括：
+
+  * prefill_tokens、decode_tokens 的长尾分布（甚至多峰）；
+  * sampling 参数（temperature, top_p, tools）对计算路径和 latency 的影响；
+  * multi-turn conversation / tool-call 带来的 stateful pattern；
+  * dynamic batching + KV cache 使得同一请求在不同系统状态下 cost 差异巨大。
+* 离线压测往往用 **“单点分布 + 恒定并发”**，无法覆盖真实的高维空间。
+
+**论文话术**：
+LLM inference workload is inherently *high-dimensional* and *stateful*, breaking the assumptions behind existing capacity planning and benchmarking techniques for web and microservice workloads.
+
+---
+
+### Challenge 2：Multi-stage Pipeline & Scheduler 行为导致强状态依赖
+
+* LLM 推理天然是 **两阶段**甚至多阶段：
+
+  * prefill（矩阵大、吞吐友好）；
+  * decode（矩阵小、step-by-step，容易被 tail 拖垮）。
+* dynamic batching / speculative decoding / MoE 激活 / KV cache 驱逐等行为，使得：
+
+  * 单个请求的 latency 并不是其“独立 cost”，而是**队列状态 + 当前 batch 其他请求的函数**。
+* offline 单请求 / 小批量 benchmark 很难反映真实 scheduler 行为。
+
+**论文话术**：
+LLM inference systems are *queueing systems with complex stateful schedulers*. Per-request cost is a function of *system state* (batch composition, cache state, competing models), invalidating the independence assumptions used in prior capacity planning models.
+
+---
+
+### Challenge 3：模型和硬件双重异构，性能曲线高度非线性
+
+* **模型异构**：
+
+  * dense vs MoE vs MLA，FP16 vs FP8/FP4，不同架构（Qwen, LLaMA, DeepSeek, GPT…）；
+  * 不同层数 / hidden size / experts 数，导致 compute / memory / bandwidth 比例完全不同。
+* **硬件异构**：
+
+  * A100 vs H100 vs L40S vs 4090 vs 国内卡，SM 数、HBM 带宽、NVLink 拓扑各不相同；
+  * 部分硬件支持特定指令 / kernel（比如 FP8/FP4、特定 GEMM kernel）。
+* 传统 “QPS = f(硬件 FLOPS)” 一类的简单模型完全失效。
+
+**论文话术**：
+Model and hardware heterogeneity yields highly non-linear and non-monotonic performance surfaces, making extrapolation from a small set of microbenchmarks fundamentally unreliable.
+
+---
+
+### Challenge 4：多目标 & 反馈回路：SLO、成本、干扰、稳定性
+
+* 目标不再是“单纯最大吞吐”，而是：
+
+  * SLO（不同租户/业务线、P95/P99）；
+  * 成本（GPU 小时 / 能耗）；
+  * 干扰与隔离（hot model vs cold model，RL / search 等长 query 对其他服务的拖累）。
+* 容量规划的结果会反过来改变系统行为：
+
+  * 调整 batch 大小 / 并发 / 混部方案，会改变 latency 和 cost；
+  * 不同模型在同一节点上的 placement 会产生“不可线性叠加”的干扰。
+
+**论文话术**：
+The system must solve a multi-objective optimization problem under closed-loop feedback: capacity decisions alter scheduler behavior and interference patterns, shifting the workload characteristics themselves.
+
+---
+
+### Challenge 5：强非平稳（non-stationary），offline 结果迅速“过期”
+
+* 模型会迭代，kernel 会升级，量化策略会变；
+* 业务 pattern 也会变（比如长上下文应用越来越多，工具调用变多等）。
+* 一个月前测出来的 “最优配置” 可能已经不再优。
+
+**论文话术**：
+LLM workloads and serving stacks are *non-stationary*: model upgrades, kernel changes, and evolving user patterns quickly invalidate static benchmark-based capacity plans.
+
+---
+
+## 二、如果当成 SOSP/OSDI 论文，storyline 可以这样讲
+
+一个典型的“系统 + 测量 + 机制”型 paper，可以这样结构化：
+
+### 1. Introduction（讲痛点 + 一组 killer numbers）
+
+* 现实背景：某大规模 LLM 服务平台（xx 个模型、yy 种 GPU、每天 zz 亿 tokens）。
+* 观测：当前行业普遍用 synthetic benchmark 做容量规划，导致：
+
+  * 为满足 tail latency SLO，**平均 over-provision 30–60% GPU**；
+  * 但在复杂高峰场景下，仍然会出现 SLO violation。
+* 提出论文目标：
+
+  > 我们要构建一个 **trace-driven、heterogeneity-aware、online-calibrated** 的容量规划与资源编排系统，将 offline benchmark 与 online 行为闭环起来。
+
+最后给出 contributions bullet list，比如：
+
+1. 第一个系统性测量 LLM inference offline benchmark vs online gap 的工作；
+2. 一个 token-centric 的 workload 抽象与 trace 压缩方法；
+3. 一个 heterogeneity-aware 的性能模型和容量规划器；
+4. 一个在线校准 + shadow serving 的闭环控制框架；
+5. 真实大规模部署，xx 模型 yy 集群上，节省 zz% GPU 成本且保持 SLO。
+
+---
+
+### 2. Background & Motivation / Measurement Study
+
+* 简要介绍现有 LLM serving stack（例如 vLLM / TensorRT-LLM / internal stack）：
+
+  * pipeline：prefill / decode，dynamic batching，KV cache；
+  * typical scheduler 行为。
+* Measurement 部分：
+
+  * 对某些典型模型 M1…Mn，在 A100/H100/4090 等上，用行业常见压测方法（fixed prompt, fixed concurrency）测试；
+  * 然后对比真实生产 trace 的表现：
+
+    * latency 分布差异；
+    * GPU 利用率差异；
+    * per-token cost 的状态依赖（batch size / cache hit 等）。
+* 这一节的目标是“打破读者的直觉”，让他们相信：
+  **传统 benchmark + naive 容量规划是 fundamentally flawed**。
+
+---
+
+### 3. System Overview：整体 architecture
+
+画一个系统图，把你的方案抽象成 4 个逻辑组件：
+
+1. **Trace Collector & Profiler**
+   从线上采集 token-level trace，构建 token-centric workload profile。
+
+2. **Trace Synthesizer & Replayer**
+
+   * 将高维 trace 压缩成少量 profile；
+   * 支持生成 synthetic workload、重放真实 trace。
+
+3. **Performance Model & Capacity Planner**
+
+   * 解析模型 + 校准点；
+   * 异构硬件 / 多模型资源分配。
+
+4. **Online Calibrator & Shadow Serving**
+
+   * 线上 metric 对比预测，更新模型；
+   * 利用 shadow 流量验证新配置。
+
+---
+
+### 4. 关键“创新机制”可以怎么讲？
+
+下面是可以写进论文、又不只是“工程 best practice”的那种机制：
+
+---
+
+#### 机制 1：Token-centric Workload Abstraction（令牌中心工作负载抽象）
+
+**痛点对应**：Challenge 1（高维 workload）
+
+**思路**：
+
+* 提出一个“token-centric”的 trace 表示，把请求抽象为：
+
+  * prefill_token_trace：每个时刻注入多少 prefill tokens；
+  * decode_token_trace：每个时刻系统需要生成多少 decode tokens；
+* 结合 arrival process，把原本复杂的请求行为用“token 流量（token flow）”来建模。
+
+**创新点可以讲**：
+
+* 提出一种**多粒度 trace 表示**：
+
+  * conversation-level；
+  * request-level；
+  * token-bucket-level；
+* 并设计压缩算法，将实际数十亿条请求压缩成少量 profile，同时保持对系统行为（batching、队列长度）的预测 精度。
+
+**论文卖点**：
+从“request-centric”转向“token-centric”的 workload 抽象，显著提高了仿真的表达力和压缩率，是 LLM 特定的新的 workload 模型。
+
+---
+
+#### 机制 2：State-aware Performance Model（状态感知性能模型）
+
+**痛点对应**：Challenge 2 & 3（scheduler 状态依赖 + 异构）
+
+**思路**：
+
+* 不用“全黑盒 curve fitting”，也不用纯理论 queueing，而是：
+
+  * 把模型的 cost 分解成几个 operator 类别（prefill matmul / decode matmul / MoE gating / KV read/write）；
+  * 用少量 microbench 估计每类 operator 在不同硬件上的 throughput；
+  * 再通过**scheduler 的状态特征**（batch_size、cache_hit_rate、active_experts）组合出 per-request cost。
+
+**更系统一点说**：
+
+$$
+T_\text{req} \approx f(L_\text{prefill}, L_\text{decode}, B_\text{prefill}, B_\text{decode}, \text{cache\_stats}, \text{experts\_activated}, \text{hw\_features})
+$$
+
+把 `f` 做成：
+
+* 一个参数化公式 + 少量校准项；
+* 或者一个小模型（比如 GBDT/MLP），输入是上述 feature，输出是 per-request latency / tokens/s。
+
+**论文卖点**：
+
+* 这是一个**state-aware、operator-aware、hardware-aware** 的统一模型；
+* 能在很少的测量点下，对未测配置做 reasonably accurate 的预测；
+* 能清楚解释“为什么这个 config 在 H100 上更好，但在 4090 上反而更差”。
+
+---
+
+#### 机制 3：Heterogeneity-aware Capacity Planning（异构感知的容量规划）
+
+**痛点对应**：Challenge 3 & 4
+
+**思路**：
+
+* 把每个 (model, hw, config) 抽象成一条“性能曲线”：
+
+  * 在不同 traffic level 下的 {tokens/s, P95 latency}；
+* 再把多模型、多硬件的组合，看成一个优化问题：
+
+> 在预算 B（或目标 tokens/s）下，选择每种硬件上的实例数和每个模型的 config，使
+>
+> * 所有 (model, traffic_class) 的 SLO 满足；
+> * 总成本最低 / 利用率最高。
+
+可以引入的技术点：
+
+* 把性能曲线离散化成 piecewise-linear segments，用 MILP / ILP 近似求解；
+* 或者用 heuristic（greedy、local search），但要有理论或实验解释“好在哪”。
+
+**论文卖点**：
+
+* 这是一个针对 LLM inference 特化的 “multi-model, multi-hardware capacity planning” formulation；
+* 支持把不同模型视作**token 流**而非“QPS”，更符合 LLM 的真实代价。
+
+---
+
+#### 机制 4：Online Calibration with Safety Guardrails（带安全边界的在线校准）
+
+**痛点对应**：Challenge 5（non-stationary）
+
+**思路**：
+
+* 运行时持续收集：
+
+  * per (model, hw, config) 的 measured latency / tokens/s；
+* 维护一个“预测 vs 实测”的误差估计，比如：
+
+  * 每个组合一个信心水平和 bias term；
+* 用类似 bandit / Bayesian update 的方式更新模型：
+
+  * 如果长期 under-predict（预测太乐观） → 自动增加安全裕度；
+  * 如果长期 over-predict → 可以安全收紧资源。
+
+**创新点可以讲**：
+
+* 不是简单手工加 20% buffer，而是**自动学习每个组合的安全系数**；
+* 对高风险配置引入更严格的 conservative bound，对老练稳定配置更激进。
+
+**论文卖点**：
+
+* 提出一个 **confidence-aware capacity planner**，在线适配 non-stationary 的 LLM 堆栈；
+* 实验中展现：在保证 SLO violation rate 不上升的前提下，把平均 GPU over-provision 从 X% 降到 Y%。
+
+---
+
+#### 机制 5：Shadow Serving + Semantic-equivalent Replay（影子服务 + 语义等价回放）
+
+（如果你愿意把“功能正确性 + 性能”一起讲）
+
+**痛点**：新模型 / 新 config 不仅要看性能，还要保证语义/答案质量不退化。
+
+**思路**：
+
+* shadow 集群接收真实流量；
+* 对于采样类任务（开放对话），做 semantic diff（embedding / reward model）；
+* 把“性能 / 质量 / 资源 cost”放在一起，形成三维 tradeoff。
+
+这块可以是 optional，但如果你做了，会让论文更“完整”——从系统工程到用户体验。
+
+---
+
+## 三、论文的“创新性定位”可以怎么说？
+
+顶会 reviewer 会问：**这和现有的 LLM serving, autoscaling, capacity planning 工作有什么根本不同？**
+
+你可以把创新性定位在三层：
+
+1. **问题定义层**：
+
+   * 从“QPS-capacity for microservices” 迁移到 “token-centric, stateful LLM inference capacity planning”；
+   * 首次系统性刻画 offline benchmark vs real-world LLM workload 的巨大 gap。
+
+2. **建模与抽象层**：
+
+   * token-centric workload 模型；
+   * state-aware performance model；
+   * heterogeneity-aware capacity planning formulation。
+
+3. **系统与实践层**：
+
+   * 在真实大规模集群落地（xx GPUs, yy models）；
+   * 提供可观测的数据点（数十亿请求 trace）、open-source 工具（如果能开源更好）。
+
+---
+
+## 四、如果你要落到具体 paper skeleton，可以是这样（中英文混合大纲）
+
+1. **Introduction**
+
+   * 背景：LLM inference at scale
+   * 痛点 & measurement teaser
+   * Our approach & contributions
+
+2. **Background & Motivation**
+
+   * LLM serving pipeline & scheduler
+   * Workload characteristics
+   * Why existing capacity planning breaks
+
+3. **Measurement: The Gap between Offline Benchmarks and Production**
+
+   * Methodology (trace collection, baseline benchmarks)
+   * Key findings（2–3 条最重要的 gap）
+   * Design implications（为后面机制买伏笔）
+
+4. **Token-centric Workload Abstraction**
+
+   * Trace model（conversation/request/token 三级）
+   * Compression & profile generation
+   * Accuracy vs size tradeoff
+
+5. **State-aware Performance Modeling**
+
+   * Operator-level decomposition
+   * Hardware heterogeneity encoding
+   * Online calibration mechanism
+
+6. **Heterogeneity-aware Capacity Planning**
+
+   * Optimization formulation
+   * Solving strategy（MILP / heuristic）
+   * Integration with workload forecaster
+
+7. **Online Calibration and Shadow Serving**
+
+   * Confidence-aware model update
+   * Shadow evaluation pipeline
+   * Safety guardrails
+
+8. **Implementation**
+
+   * Integration with existing LLM serving stack (vLLM/…)
+   * Engineering details（overhead, compatibility）
+
+9. **Evaluation**
+
+   * Micro-bench: modeling accuracy, calibration speed
+   * Cluster-level: GPU savings, SLO satisfaction, robustness
+   * Ablation: no calibration / no heterogeneity / no token-centric
+
+10. **Related Work & Discussion**
+
+11. **Conclusion**
+
+---
+
+如果你愿意，下一步我们可以更具体地“代写”两块：
+
+* **Measurement 部分**：一起设计 3–4 个关键 finding（包括可能的图，比如“offline latency vs online latency”的散点图、“token-centric profile 压缩比 vs 误差”的图）；
+* **Design 部分**：把你已经有的 auto-tuner / vLLM 实验管线映射到上面的几个机制中，形成真正可落地的“系统 + 机制”叙事。
--- a/projects/auto-tuner/principles
+++ b/projects/auto-tuner/principles
@@ -0,0 +1,224 @@
+## 五、怎么从理论上建模这个 principle
+
+这个现象很适合用一个“**固定 GPU 预算下的排队模型**”来解释。
+
+### 1. 固定总 GPU 数
+
+设总 GPU 数固定为 $G$，选择的 tensor parallel size 为 $t$。
+
+那么可部署的实例数是：
+
+$$
+m_t = \frac{G}{t}
+$$
+
+这是整个问题最核心的约束。
+因为 TP 变大，带来的不是“免费加速”，而是“用更多 GPU 服务一个 instance”，因此 instance 数会减少。
+
+---
+
+### 2. 单请求 service time
+
+对某个 workload window $w$，在 TP 为 $t$ 时，一个请求的 service time 可以写成：
+
+$$
+S_t(w) = S_{\text{comp}}(w,t) + S_{\text{comm}}(t) + S_{\text{sched}}(w,t)
+$$
+
+其中：
+
+#### 计算项
+
+$$
+S_{\text{comp}}(w,t) \approx \frac{A(w)}{\eta_t}
+$$
+
+* $A(w)$ 表示该窗口请求的纯计算量
+* $\eta_t$ 表示 TP 带来的 speedup，通常是次线性的，即 $\eta_t < t$
+
+#### 通信项
+
+$$
+S_{\text{comm}}(t)
+$$
+
+这个项通常随 TP 增大而上升，包括：
+
+* all-reduce
+* synchronization
+* cross-GPU launch / coordination
+* pipeline / collective overhead
+
+#### 调度与干扰项
+
+$$
+S_{\text{sched}}(w,t)
+$$
+
+这个项反映的是：
+
+* batch 内请求相互干扰
+* 长请求阻塞短请求
+* 调度器等待
+* cache / memory / token budget 竞争
+
+---
+
+### 3. 为什么低负载下大 TP latency 更好
+
+低负载时，排队几乎可以忽略，因此：
+
+$$
+\text{TTFT}_{p95}(t,w) \approx S_t(w)
+$$
+
+此时系统 tail latency 基本由 service time 决定。
+
+如果 workload 比较 prefill-heavy，或者单请求计算量较大，那么：
+
+* 计算项占主导
+* TP 的并行收益大于通信损失
+
+于是会出现：
+
+$$
+S_{TP4}(w) < S_{TP1}(w)
+$$
+
+这正对应你在低负载图里看到的现象：
+$TP4$ 的 $p95_{ttft}$ 明显优于 $TP1$。
+
+---
+
+### 4. 为什么低负载时 $good\_token/s/GPU$ 差异很小
+
+定义 TP 为 $t$ 时，单实例可提供的最大处理能力为 $R_t(w)$。
+那么整个集群总能力是：
+
+$$
+m_t R_t(w) = \frac{G}{t} R_t(w)
+$$
+
+按 GPU 数归一化后，单位 GPU 能力为：
+
+$$
+C_t(w) = \frac{1}{G} \cdot \frac{G}{t} R_t(w) = \frac{R_t(w)}{t}
+$$
+
+如果 TP 加速接近线性，即：
+
+$$
+R_t(w) \approx t R_1(w)
+$$
+
+那么：
+
+$$
+C_t(w) \approx R_1(w)
+$$
+
+也就是说单位 GPU 吞吐几乎不变。
+
+但更关键的是，在低负载下，真实 goodput 不是被 capacity 限制，而是被外部到达流量限制。于是：
+
+$$
+\text{goodput}_t(w) \approx \text{offered load}
+$$
+
+因此不同 TP 的 $good\_token/s/GPU$ 看起来会非常接近。
+这正是你在 $time\_scale=0.5$ 里看到的现象。
+
+---
+
+### 5. 为什么高负载时更小 TP 更占优
+
+一旦负载升高，问题就不再只是单请求 service time，而变成“每个 instance 会不会排队”。
+
+假设整个集群的总到达率是 $\lambda$，那么在 TP 为 $t$ 时，每个 instance 平均承担的到达率是：
+
+$$
+\lambda_t = \frac{\lambda}{m_t} = \frac{\lambda t}{G}
+$$
+
+这非常关键，因为它说明：
+
+* TP 越大
+* instance 数越少
+* 每个 instance 的流量越重
+
+于是实例利用率约为：
+
+$$
+\rho_t = \lambda_t \cdot \mathbb{E}[S_t]
+= \frac{\lambda t}{G} \mathbb{E}[S_t]
+$$
+
+当 $\rho_t$ 接近 $1$ 时，排队延迟会急剧放大。
+因此 tail latency 可以近似理解为：
+
+$$
+\text{TTFT}_{p95}(t,w)
+\approx S_t(w) + W_q(t,w)
+$$
+
+其中 $W_q$ 是 queueing delay。
+
+而 $W_q$ 会随着以下因素快速上升：
+
+* $\rho_t$ 上升
+* service time variance 上升
+* instance 数 $m_t$ 减少
+* workload burstiness 增加
+
+这就是为什么高负载下，虽然大 TP 可能让单请求更快，但整体上却不一定更优：
+因为它牺牲了多实例并发能力，把更多流量压到更少的 instance 上，最终 tail latency 和 SLO-goodput 反而变差。
+
+---
+
+### 6. 为什么 coder 比 chat 更敏感
+
+你的图里最明显的区别就是：
+
+* chat 对 TP 更“温和”
+* coder 更早进入“小 TP / 中 TP 更优”的 regime
+
+这通常意味着 coder workload 有更高的异质性，比如：
+
+* request length variance 更大
+* long-request fraction 更高
+* burstiness 更强
+* prefill / decode mix 更复杂
+
+从排队论角度，一个很重要的量是 service time 的二阶矩。
+如果服务时间随机变量为 $S$，那么 queueing delay 对 $\mathbb{E}[S^2]$ 很敏感。
+当 length variance 变大时，$\mathbb{E}[S^2]$ 会明显上升，因此 tail latency 会更容易爆掉。
+
+所以 coder 更早出现：
+
+* $TP4$ goodput 不占优
+* 最优 TP 向 $TP2$ 或 $TP1$ 漂移
+
+这是很合理的。
+
+
+## 八、还差什么实验，才能把这个变成更强的 paper claim
+
+现在这组图已经足够支持“趋势”，但如果你想把它上升到更强的 principle，最好再补一个 **regime boundary** 实验。
+
+不要只用 $time\_scale$ 当横轴，而是用更本质的 offered-load 指标，比如：
+
+* offered prefill tokens/s per GPU
+* offered requests/s per instance
+* 或者更 general 的 normalized utilization proxy
+
+然后画：
+
+* 横轴：offered load
+* 纵轴：best TP
+* 或者纵轴：相对 $TP4$ 的 SLO-goodput gain
+
+这样你就能真正得到一个“转折点”：
+
+> 当 load 低于某阈值时，大 TP 更优；超过阈值后，最优 TP 向中等或更小 TP 漂移。
+
+这个图会非常强，因为它把现在的“观察”变成了“phase transition / regime transition”。
--- a/projects/auto-tuner/related-works.figs/260410-105227.png
+++ b/projects/auto-tuner/related-works.figs/260410-105227.png
--- a/projects/auto-tuner/related-works.md
+++ b/projects/auto-tuner/related-works.md
@@ -0,0 +1,451 @@
+Related Work Matrix — Workload Pattern → Serving Engine Config
+
+| Paper                                                                                                  | Venue / Year | 主要研究对象                            | Workload signals / patterns considered                 | Optimized knobs / decisions                                                               | Objective                                         | 方法类型                                 | 与“workload→config”关系    | 对 AITuner 的直接启发                                                     | 是否已提出关键 insight                              |
+| ------------------------------------------------------------------------------------------------------ | ------------ | --------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | ------------------------------------------------- | ------------------------------------ | ----------------------- | ------------------------------------------------------------------- | -------------------------------------------- |
+| AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving              | arXiv 2026   | 多框架 LLM serving config 优化         | production serving workloads；模型类型；硬件平台；SLA/TTFT/TPOT需求 | TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags | latency / throughput / goodput 风格目标               | 性能建模 + 快速配置搜索                        | 最直接命中                   | 证明“最优 engine config 明显 workload-dependent”，而且 config 空间不止 scheduler | 是，且最直接                                       |
+| Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto)             | arXiv 2026   | KV cache 分层存储配置                   | real-world traces；KV block access pattern；reuse；不同优化目标 | HBM/DRAM/disk tier config；eviction/group-specific cache mgmt                              | latency / throughput / cost Pareto frontier       | simulator + pruning + adaptive tuner | 直接命中，但只在 KV/storage 子空间 | 说明 memory-side knobs 必须 workload-aware，且天然是多目标优化                    | 是，但局限于 KV tiering                            |
+| Serving Generative Large Language Models on Preemptible Instances (SpotServe)                          | ASPLOS 2024  | 抢占式云环境下的 LLM serving              | fluctuating workload；实例可用性变化；preemption trace          | distributed parallelization config；migration strategy                                     | latency / throughput / monetary cost              | 系统设计 + 动态重配置                         | 部分命中                    | 说明 config = f(workload, resource-state)，不是只由 workload 决定            | 是，但偏部署/并行配置                                  |
+| Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) | arXiv 2025   | workload-aware runtime adaptation | bursty workload；real-time memory pressure；load surge   | layer quantization/swapping；KV cache resizing                                             | SLO violations、TTFT、quality-preserving efficiency | runtime adaptive serving             | 部分命中                    | 说明静态最优 config 不适合 bursty workload，online adaptive config 很重要        | 是，但偏 runtime morphology                      |
+| Llumnix: Dynamic Scheduling for Large Language Model Serving                                           | OSDI 2024    | heterogeneous requests 的动态调度      | heterogeneous / unpredictable requests；不同优先级/SLO       | request placement、migration、instance-level scheduling                                     | tail latency、priority acceleration、cost saving    | 动态 rescheduling + live migration     | 邻近，不是 config tuning     | 说明高异质 workload 下，先做实例级隔离/迁移可能比局部 knob tuning 更重要                    | 提出了 workload→architecture/scheduling insight |
+
+- 1/2
+
+- 1
+- 2
+
+| Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request rate；phase contention；SLO pressure | prefill/decode partition ratio；resource controller；unified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob，而且 workload-dependent | 是，但聚焦 phase partition |  
+| Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client composition；temporal structure；model mix；multimodal/reasoning；large-scale production traces | 不直接优化 config | benchmark realism；avoid under-provisioning | characterization + workload generator | 基础支撑，不直接命中 | 说明如果 workload 表征不真实，任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight |  
+| BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstiness；conversation patterns；response lengths；system failures | 不直接优化 config | evaluation realism；stress realistic serving behavior | trace dataset / characterization | 基础支撑，不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是，提出真实 workload 重要性 |  
+| Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency level；interactive vs batch use case；model size | framework choice（vLLM vs TGI） | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 |  
+| Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latency；pipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你：未来 workload→config 问题会扩展到 multi-stage，不再只看 prefill/decode | 部分提出，多阶段视角 |
+
+我建议你在论文里把它们再压成 4 类
+
+A. Workload characterization / benchmark realism
+
+这些 paper 不直接调 config，但决定你的论证是否站得住。
+
+|Paper|你该怎么引用它|
+|---|---|
+|ServeGen|用来证明：真实 workload 很复杂，synthetic trace 容易误导配置结论|
+|BurstGPT|用来证明：burstiness、conversation pattern、response length 这些 workload feature 不能忽略|
+
+B. Workload-aware scheduling / routing / partitioning
+
+这些不是在调 engine knobs，但给了你 workload pattern 的核心结构。
+
+|Paper|核心已知 insight|
+|---|---|
+|Llumnix|heterogeneous workload 需要跨实例 rescheduling / isolation|
+|EWSJF|mixed workload 应先分组/分 regime，再优化|
+|CascadeInfer|length heterogeneity 是核心 bottleneck|
+|Sarathi-Serve|prefill/decode 冲突是第一性瓶颈|
+|semi-PD|phase split ratio 应随 workload / SLO 改变|
+
+C. Workload-aware configuration / adaptive resource shaping
+
+这些和你最接近。
+
+|Paper|命中子空间|
+|---|---|
+|AIConfigurator|通用 serving config search|
+|Kareto|KV / storage configuration|
+|SpotServe|distributed parallelization configuration|
+|MorphServe|online adaptive resource / precision configuration|
+
+D. Framework / pipeline-level coarse configuration
+
+这些告诉你研究边界在哪里。
+
+|Paper|边界意义|
+|---|---|
+|Comparative vLLM vs TGI|engine choice 本身就是 coarse-grained config|
+|HERMES|multi-stage pipeline 会让 workload→config 问题更复杂|
+
+如果你想在 paper 里更突出 gap，可以直接用这张“缺口矩阵”
+
+|维度|现有工作覆盖情况|你的空间|
+|---|---|---|
+|真实 workload characterization|强|不必重复造数据集，但要吸收其 feature 定义|
+|workload-aware scheduling|很强|不要和它们正面重合|
+|单一子空间 config adaptation（KV / parallelism / runtime precision）|中等|可以统一这些子空间 insight|
+|通用 workload→engine-knob mapping|弱|这是你的主要机会|
+|在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化|很弱|这是你最强的位置|
+
+- 1/2
+
+- 1
+- 2
+
+我帮你提炼成一句话版 related work positioning
+
+你后面写论文时，可以这么说：
+
+Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.
+
+这句话基本能把你和上面这些 work 的关系讲清楚。
+
+我建议你下一步补一张“feature × knob”矩阵
+
+也就是：  
+
+- 行：workload features
+    
+
+- arrival burstiness
+    
+
+- input length mean/variance/tail
+    
+
+- output length mean/variance/tail
+    
+
+- mixed-workload ratio
+    
+
+- prefix/KV locality
+    
+
+- SLO strictness
+    
+
+- 列：config knobs
+    
+
+- max_num_batched_tokens
+    
+
+- scheduler mode
+    
+
+- prefill/decode split
+    
+
+- KV memory fraction
+    
+
+- block size
+    
+
+- parallelism
+    
+
+- disaggregation mode
+    
+
+- cache/tier policy
+    
+
+然后标每篇 paper 实际覆盖了哪些 feature→knob 边。  
+  
+这个会比 paper matrix 更直接服务你的项目。  
+  
+如果你要，我下一条就直接给你整理 feature × knob insight matrix。
+
+
+
+[ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999)
+
+[Glia: A Human-Inspired AI for Automated Systems Design and Optimization](https://arxiv.org/pdf/2510.27176)
+[VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE](https://arxiv.org/pdf/2405.05465)
+
+[DynamoLLM](https://arxiv.org/pdf/2408.00741)：profile（不同的 TP、workload） + search（target：满足 SLO 下最小化能耗） + dynamic 调整 parallelism configuration
+[NanoFlow](https://arxiv.org/pdf/2408.12757)：Theoretical model + automated pipeline search（split sequence / operation，target：最大化 compute/memory/network 利用率）
+
+
+https://github.com/llm-d/llm-d
+
+AIBrix 实验性的提出了异构 GPU 推理，workload 有不同的 size，不同的 GPU 适合不同 request size 的甜点区间，offline profile，online scheduling，异构硬件 serving 时最小化 cost
+这存在老生常谈的问题：offline profile 与实际 serving 的 gap
+https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst
+
+
+agent related:
+- StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017
+- ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844
+- STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887
+
+AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288
+
+SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323
+https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning
+
+> Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).
+> 
+> SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.
+
+![[projects/auto-tuner/related-works.figs/260410-105227.png]]
+
+SLO-Aware SchedulingforLarge Language Model Inferences
+https://arxiv.org/pdf/2504.14966
+
+
+https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf
+
+
+FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
+https://arxiv.org/abs/2602.22593
+
+
+
+
+
+
+你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过：**定义搜索空间/动作 → 运行试验（trial）→ 测量目标 → 用算法选择下一步动作**。
+
+- **ATC’18（Cao et al.）**在解释黑箱 auto-tuning 时，把机制写得非常标准化：_“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”_。
+https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf
+
+- **OSDI’23（Hydro）**对超参/配置调优同样给出“工作流定义”：用户指定搜索空间；算法生成 trials；系统协调执行直到找出最优配置。
+https://www.usenix.org/system/files/osdi23-hu.pdf
+	
+- **ATC’18（Metis）**把自己定位为“black-box optimization service”，在真实生产系统上调参并以 tail latency 为主要评估指标。
+https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf
+    
+- **OSDI’18（µTune）**属于“在线自适应”流派，但它仍然是同一闭环：监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。
+https://www.usenix.org/system/files/osdi18-sriraman.pdf
+    
+- **SOSP’21（POP）**是“数学优化/求解器”流派，它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式，并讨论求解速度与 SLA 的权衡。
+   https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf 
+
+**结论**：无论是黑箱搜索、在线控制、强化学习、还是数学规划，顶会系统优化基本都落在同一个元结构上：
+
+> **在预算与约束下，围绕某个“可执行试验”的闭环，迭代地产生动作并更新策略。**
+
+---
+# related works
+
+[DynamoLLM](https://arxiv.org/pdf/2408.00741)：profile（不同的 TP、workload） + search（target：满足 SLO 下最小化能耗） + dynamic 调整 parallelism configuration
+[NanoFlow](https://arxiv.org/pdf/2408.12757)：Theoretical model + automated pipeline search（split sequence / operation，target：最大化 compute/memory/network 利用率）
+[MorphServe](https://arxiv.org/pdf/2506.02006v1)
+[Autocomp](https://arxiv.org/pdf/2505.18574v3)
+
+
+
+# Questions
+
+1. [Mooncake 访谈](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK) 10:00 处，业务场景变得极为丰富，需要不同的配置，PD 分离时几P几D，P 和 D 节点内分别跑什么样的并行模式都需要调优，Qwen 线上是否有明确的需要不同配置的需求？或者目前 Qwen 的现状还是根据人工测试调整，找到一个整体相对比较优的配置固定作为一个模型的配置，并不涉及线上的负载感知与动态调整？我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置？还是有一些更好的工作流？
+2. Qwen 系列模型是否已经上线 EP？如果上线了 EP，为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案？我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能？
+3. Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化，可以认为这是一个普遍的趋势吗？
+
+
+---
+# Backup
+
+
+
+
+
+> 说明：这里的“自动优化”聚焦**自动化调度/并行/批处理/能耗-成本/KV-Cache 管理**等系统层机制，而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张（以作者给出的 headline 为准）。
+
+### 速览对照表
+
+| 工作（年份/来源）                                                                     | 一句话核心                                                  | 关键自动化机制                                                                | 公开性能主张（作者给出）                                | 与相近工作的差异点                                      | 在 vLLM / SGLang 落地                                         |
+| ----------------------------------------------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------------- | ---------------------------------------------- | ---------------------------------------------------------- |
+| **DynamoLLM** (2024, HPCA’25) ([arXiv][1])                                    | 面向**集群级**的能耗/成本最优化：在满足 SLO 的前提下**自动重构**推理集群配置          | 负载-功耗感知的**动态集群重构**、分层控制                                                | 节能 ~52–53%，运营碳减 38%，成本降 61%，SLO 保持          | 目标是**服务级/集群级**运行点最优化，而非 GPU 内核/批处理细节           | 与 vLLM/SGLang **并行部署**（外部编排层），非内嵌；可用作上层资源控制器               |
+| **NanoFlow** (2024→2025) ([arXiv][2])                                         | 在**单 GPU**上把一次请求切成多个**nano-batches**并自动搜寻**并行-重叠流水线**  | **自动搜索**nano-batch 的数量/大小/顺序/资源配额；算子共调度以重叠算力/显存/通信                     | 对多基线（vLLM、FastGen、TRT-LLM）最高 **1.91×** 吞吐提升 | 强调**设备内**并行与流水重叠；对 decode 轻/ prefll 重的结构做细粒度流水 | 目前为**独立运行时**（开源实现），未并入 vLLM/SGLang 主干；可作替代后端 ([GitHub][3]) |
+| **Sarathi & Sarathi-Serve** (2023–2024) ([arXiv][4])                          | 把长**prefill**切块，并与**decode**混合成连续混批，减少流水“气泡”           | **Chunked prefill** + **continuous hybrid batching**（decode piggyback） | 端到端最高 **1.91×**；decode 吞吐最高 **10×**         | 最早系统化提出“**prefill-decode 混合**一批”的通用策略          | vLLM/SGLang 均已支持/吸收该思路（见下文“落地”） ([VLLM Documentation][5])  |
+| **DeepSpeed-FastGen** (2024) ([arXiv][6])                                     | **Blocked KV** + **Dynamic SplitFuse 连续批处理**，兼顾低延迟与高吞吐 | 动态拆分/融合批次，KV 分块复用                                                      | 多模型/硬件上优于 vLLM（作者报告）                        | 方案与 vLLM 的 PagedAttention 路线相近但实现不同            | 作为**独立后端**；与 vLLM/SGLang 互为替代                              |
+| **POD-Attention** (ASPLOS’25) ([Microsoft][7])                                | 追求**prefill 与 decode 的全重叠**，降低 Token Break Time（TBT）   | Prefill/Decode **完全并行化调度**与相容内核                                        | 在长上下文/高 TBT 场景显著降停顿（定性+量化）                  | 把 Sarathi 的“混合批”更推进到**近全重叠**                   | 学术原型；思想可迁移到 vLLM/SGLang 的调度/内核层                            |
+| **Fluid-Guided Online Scheduling（WAIT/Nested WAIT）** (2025, SSRN) ([SSRN][8]) | 把 LLM 推理抽象为**带 KV 内存约束**的多阶段**在线调度**，给出近似最优策略          | 基于流体模型的**在线批量与内存配给**决策                                                 | 实验优于 vLLM/Sarathi 的吞吐/时延（作者报告）              | 强理论导向，**内存-批量-时延**三者联动的在线算法                    | 研究性，尚未并入主流实现                                               |
+| **Memory-aware 动态批处理** (2025) ([arXiv][9])                                    | 运行时监控**显存与 SLA**，**自适应**批大小与解码过程                       | 显存感知的批调度 + 延迟反馈回路                                                      | 在 Llama-7B+A100 上优于固定超参的吞吐/时延               | 更工程化的**在线批超参**调节方法                             | 思路与 vLLM/SGLang 的自调度兼容；尚无主干合并记录                            |
+| **HyGen** (2025) ([arXiv][10])                                                | **线上/离线**融合：两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力               | 两阶段 SLO-aware 调度与隔离                                                    | 在离线共置下维持在线延迟并提升整体利用率                        | 关注**业务混部**而非单一推理吞吐                             | 可与任一后端并行部署；非 vLLM/SGLang 内核改动                              |
+| **PrefillOnly** (2025) ([arXiv][11])                                          | 针对**只需 1 token 输出**的“prefill-only”工作负载做**极简 KV**与路径优化  | 仅保留**最后一层** KV、轻量运行路径                                                  | 对检索/分类式工作负载显著提速降时延                          | 面向特定负载类型的**路径裁剪**                              | 可作为特型后端/路径对接主流引擎                                           |
+| **vLLM / PagedAttention** (2023→) ([arXiv][12])                               | **分页化 KV-Cache** + 预占式调度，近零碎片、易做连续批处理                  | 块级内存管理、连续批、请求抢占                                                        | 对 HF baseline 最高 **24×** 吞吐（早期报告）           | 率先把**内存分页**与**连续批**标准化                         | 已成为事实上的开源主流服务引擎之一                                          |
+| **Throughput-Optimal Scheduling for LLM Serving** (2025) ([arXiv][13])        | 从理论上给**连续批处理**的吞吐上界/最优策略                               | 令牌级排队/匹配策略                                                             | 给出与实践接轨的理论最优性结果                             | 偏理论基线，指导工程实现                                   | 待工程化吸收                                                     |
+| **Learning-to-Rank Scheduler** (2024)            | 预测请求**相对长度排序**，更逼近 SJF 以降延迟/提吞吐                              | LTR 训练的调度器，按预测顺序排队/并批                      | 相比现有基线更优的延迟/完成时间            | 与两者同属**自动排队/批形成**                   | 可接入 vLLM/SGLang 的 admission/排队阶段。 ([arXiv][5])                       |
+| **Online Scheduling with KV Constraints** (2025) | 把 LLM 推理抽象为**带 KV 内存约束**的在线调度，给出近优策略                         | 流体/队列模型 + 在线批/KV 决策                        | 优于 vLLM/Sarathi 基线（作者报告）    | 与两者同属**自动在线调度**理论化                  | 可做 vLLM/SGLang 的**策略插件**（需低开销遥测）。 ([arXiv][8])                       |
+| **Fairness-Aware Batch Formation** (2025.10)     | 在**连续/混合批**下自动平衡新/旧请求的“计算公平性”与吞吐                             | 批内配额/重排策略                                  | 在保持吞吐下显著改善不公平               | 同属批形成自动化（与 Sarathi/Chunked 互补）      | 可改造 vLLM/SGLang 的 batcher。 ([arXiv][10])                             |
+| **Drift（PD-Multiplexing）** (2025)                | **相位解耦复用**：预填与解码相位分离并“就地”复用，缓解吞吐-时延拉扯                        | 相位解耦 + in-place compute 复用                 | 在多负载下同时提吞吐与守 SLO            | 与 NanoFlow/POD 同属**相位级重叠**路线        | 需要内核/调度联动，适合作为引擎深改方向。 ([Han Zhao 赵涵][11])                            |
+
+[1]: https://arxiv.org/html/2408.00741v1?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."
+[2]: https://arxiv.org/abs/2408.12757?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
+[3]: https://github.com/efeslab/Nanoflow?utm_source=chatgpt.com "efeslab/Nanoflow: A throughput-oriented high-performance ..."
+[4]: https://arxiv.org/abs/2308.16369?utm_source=chatgpt.com "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills"
+[5]: https://docs.vllm.ai/en/v0.4.2/models/performance.html?utm_source=chatgpt.com "Performance and Tuning - vLLM"
+[6]: https://arxiv.org/pdf/2401.08671?utm_source=chatgpt.com "DeepSpeed-FastGen: High-throughput Text Generation for ..."
+[7]: https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/POD-Attention-ASPLOS25.pdf?utm_source=chatgpt.com "POD-Attention: Unlocking Full Prefill-Decode Overlap for ..."
+[8]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5195463&utm_source=chatgpt.com "Optimizing LLM Inference: Fluid-Guided Online Scheduling ..."
+[9]: https://arxiv.org/pdf/2503.05248?utm_source=chatgpt.com "Optimizing LLM Inference Throughput via Memory-aware ..."
+[10]: https://arxiv.org/html/2501.14808v2?utm_source=chatgpt.com "1 Introduction"
+[11]: https://arxiv.org/html/2505.07203v1?utm_source=chatgpt.com "PrefillOnly: An Inference Engine for Prefill-only Workloads ..."
+[12]: https://arxiv.org/pdf/2309.06180?utm_source=chatgpt.com "Efficient Memory Management for Large Language Model ..."
+[13]: https://arxiv.org/html/2504.07347v1?utm_source=chatgpt.com "Throughput-Optimal Scheduling Algorithms for LLM ..."
+[14]: https://homes.cs.washington.edu/~arvind/papers/nanoflow.pdf?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
+[15]: https://discuss.vllm.ai/t/does-the-vllm-v1-support-speculative-decoding-now/191?utm_source=chatgpt.com "Does the vLLM v1 support Speculative Decoding now?"
+[16]: https://github.com/sgl-project/sglang/issues/2273?utm_source=chatgpt.com "[Kernel] Launch two kernels for mixed chunked prefill #2273"
+[17]: https://github.com/sgl-project/sglang/issues/6553?utm_source=chatgpt.com "[PD] Support Multi-Process for TokenizerManager #6553"
+[18]: https://arxiv.org/html/2312.07104v1?utm_source=chatgpt.com "Efficiently Programming Large Language Models using ..."
+[19]: https://iacoma.cs.uiuc.edu/iacoma-papers/hpca25_2.pdf?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."
+
+
+
+
+## Qwen optimization
+
+
+Catalogued how Alibaba’s Qwen-family models have been tuned across the codebase.
+
+**Optimization Matrix**  
+
+| Model scope                                | Optimization & commits                                                                                                                                                                                                                                        | Why it was added                                                                                                                | Feature / hardware fit                                                                                 |
+| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
+| Qwen3-VL                                   | Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443                                                                                                                                                     | Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids                     | Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput              |
+| Qwen3-VL                                   | Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296                                                                                                                   | Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation                       | Sustains high-frame video pipelines and DP sharding without CUDA synchronizations                      |
+| Qwen3-VL                                   | Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420                                                    | Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability | Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements            |
+| Qwen2.5-VL                                 | Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814                                                                                                               | Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback                  | Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs |
+| Qwen2.5-VL                                 | O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897                                                                                                             | Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously                                 | Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads         |
+| Qwen2-VL                                   | Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805                                                                                                                                 | Allows disabling TP in favor of DP for the encoder and caches attention metadata                                                | Optimizes multi-GPU deployments handling many images per request                                       |
+| Qwen2/2.5-VL startup                       | Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360                                                                                                           | Avoids generating huge fake inputs to find token limits, slashing initialization time                                           | Keeps autoregressive limits aligned with true max resolution / frame counts before serving             |
+| Qwen series (CUDA & ROCm)                  | Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274                                                                                                                | Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback                                                    | Maintains high-throughput rotary embedding on both CUDA and ROCm stacks                                |
+| Qwen2 text & MoE                           | FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1                                              | Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels                                                  | Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4                           |
+| Qwen3 MoE family (Next / Coder / Thinking) | Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 | Provides block-size / warp tuning for each TP×EP layout and precision mix                                                       | Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning              |
+
+Key observations
+
+- Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
+- Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
+- Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.
+
+Next steps
+
+1. Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
+2. Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.
+
+
+
+
+| Model                  | Optimization                                                                                                                                                                            | Why it was added                                                                                                                                 | Feature / hardware fit                                                          |
+| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- |
+| Qwen3‑Next             | SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) | Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly | Hybrid MoE layers with shared experts on DP/EP deployments                      |
+| Qwen3‑Next             | Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342)                                      | Guarantees tokens are sharded across TP ranks before expert dispatch so TP × EP + DeepEP doesn’t repeat work                                     | Large-TG throughput setups using TP+EP or DeepEP all-to-all                     |
+| Qwen3‑Next             | Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59)                                    | Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills                         | Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding |
+| Qwen3‑Next             | Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471)                                                                            | Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges                                      | Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths                   |
+| Qwen3‑Next             | Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306)                                                                                                     | Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device()                                         | Unlocks ROCm and future device backends without code forks                      |
+| Qwen3‑Next MoE configs | Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1)                            | Ships per-topology block sizes and warp counts that match Alibaba’s published sweeps                                                             | Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper       |
+| Qwen3 dense            | Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199)                                                                                                            | Exposes transformers’ dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up                                                  | Long-context serving (>128 K) on dense Qwen3 checkpoints                        |
+| Qwen3‑MoE              | Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192)                                                                                                  | Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls                                                | TP × EP launches with DeepEP / allgather-reducescatter all-to-all               |
+| Qwen3‑MoE              | Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167)                                                                                             | Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer                                                     | Any MoE run with TP>1, especially bandwidth-bound clusters                      |
+| Qwen3‑MoE              | FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021)         | Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded                                           | Delivers FP8 throughput on Triton / Cutlass backends for 64–128 experts         |
+| Qwen3‑MoE              | Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467)                                                                              | Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban                                                    | Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models    |
+
+Highlights & context
+
+- The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
+- SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
+- Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibaba’s long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.
+
+Next steps
+
+1. Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
+2. If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.
+
+
+
+Evidence: when the team first tuned Qwen2’s 57 B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53 req/s (11 058 tok/s) to 12.47 req/s (13 089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with hand‑picked block and warp sizes for that GPU. The newer GB200 FP8 tables you’re looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance – that’s why they use more aggressive BLOCK_SIZE_N and higher num_stages.
+
+
+
+
+---
+
+**Runtime Optimizations**
+
+| Feature / Area                          | Commits (Date)                                                      | Optimization Highlights                                                                                                                                                      | Perf Notes                                                      |
+| --------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
+| Qwen2/2.5-VL vision attention rearrange | 5c2acb27 (2025-10-18)                                               | Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py).                                             | Not reported; removes extra tensor copies.                      |
+| Qwen3-next gated RMSNorm                | 82e64c7 (2025-10-12)                                                | Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py).              | No metric; fewer kernel launches and better occupancy.          |
+| Qwen3-next MTP bool-mask handling       | 785d8b6 (2025-10-16)                                                | Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction.        | Not stated; expected higher MTP throughput.                     |
+| Qwen3-VL fast_pos_embed_interpolate     | 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12)   | Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py.         | No metric; reduces Python overhead for large image/video grids. |
+| Qwen3-VL multimodal tensor prep         | 0426e3c5 & 2c1c7dfb (2025-10-09)                                    | _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel.                             | Not reported; fewer allocations/kernels.                        |
+| Qwen3-VL deepstack splitting            | 1dfea5f4 (2025-09-19)                                               | Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. | No metric; lighter host-side work.                              |
+| Qwen3-VL interleaved MRoPE              | cea91a32 (2025-09-19)                                               | Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py).                                                       | Not reported; avoids slow native fallback.                      |
+| Qwen3Moe fused output                   | 4f510bc2 (2025-08-20)                                               | Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py.                              | Not stated; less collective traffic.                            |
+| Qwen3-next FP8 checkpoints              | ef7eefe1 (2025-09-18)                                               | Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py.                                | No metric; enables FP8 inference path.                          |
+| Qwen3 FP8 accuracy guard                | a258ad8b (2025-08-17)                                               | Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py).                                                                             | Not reported; quality fix to keep FP8 speed viable.             |
+| Qwen3 fused RMSNorm                     | f80ae5bd (2025-05-07)                                               | Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models.                                                                           | No metric; fewer RMSNorm kernels.                               |
+| Qwen3 reasoning parser                  | 015069b0 (2025-05-01)                                               | Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py.                                                                          | Not measured; faster host parsing.                              |
+| Qwen2.5-VL rotary/window pipeline       | 67da5720 (2025-05-16); e283976f (2025-09-09)                        | Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py.                  | No metrics; eliminates repeated cudaMemcpy and sorting.         |
+| Qwen2/2.5-VL CUDA sync avoidance        | 6772bb0f (2025-08-13); 60f0843e (2025-09-08)                        | Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py.                                                  | Not reported; prevents blocking host waits.                     |
+| Qwen2.5-VL normalization & SDPA         | 02ed8a1f (2025-02-13); cbc8457b (2025-08-07)                        | Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py.                                                   | No metric; more efficient norm/attention.                       |
+| Qwen2/2.5-VL init & DP                  | 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) | Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency.                                              | Not reported; faster init and lower memory.                     |
+| Qwen2/2.5-VL attention masks            | 47c7126 (2025-03-21)                                                | Restored attention mask precomputation for mixed window/full attention paths.                                                                                                | No metrics; reduces per-forward recompute.                      |
+| Qwen2-VL cudaMemcpy reduction           | 70b808fe (2025-03-11)                                               | Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention.                                                                           | Not reported; lowers cudaMemcpyAsync volume.                    |
+| Qwen2 FP8 KV cache                      | da971ec7 (2024-06-19)                                               | Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth.                                                                                           | No metrics; expected memory savings.                            |
+| Qwen2 pipeline parallel                 | 1d2e7fb7 (2024-08-01)                                               | Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs.                                                                                    | No metric; enables scale-out throughput.                        |
+| Qwen LoRA punica specialization         | 1f567421 (2024-06-21); 8435b207 (2024-05-17)                        | Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels.                                                                                      | No metrics; better LoRA throughput.                             |
+
+**Kernel Config Tuning (Fused MoE / FP8)**
+
+|Commits (Date)|Model / HW Target|Optimization|Perf Metrics|
+|---|---|---|---|
+|4d0f2661 (2025-10-20)|Qwen3-30B A3/A3B on H100 (FP8 & BF16)|Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes.|Not reported.|
+|f96bc364 (2025-10-15)|Qwen3-Next FP8 on H100 TP=2|Introduced TP2-specific FP8 fused_moe config.|Not reported.|
+|238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07)|Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100|New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage.|Not reported.|
+|8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28)|Qwen3-Coder-480B-A35B on NVIDIA H20-3e|Added FP8 fused_moe configs for large coder variant.|Not reported.|
+|2d40665 (2025-06-11)|Qwen3-30B A3B on NVIDIA B200|Introduced B200-specific FP8 fused_moe config.|Not reported.|
+|22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08)|Qwen3-235B A22B on NVIDIA H20-3e & A100|Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100).|Not reported.|
+|8fc88d63 (2025-04-28)|Qwen3 MoE (H100/H200/H20 targets)|Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README.|Not reported.|
+|dcbac4cb (2025-04-28)|Qwen3 dense FP8|Adjusted linear layers so FP8 compatibility works with fused kernels.|Not reported.|
+|2007d4d5 & f5a3c655 (2025-05-01)|Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X|Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]).|Not reported.|
+|bd439735 (2024-06-14)|Qwen2-57B-A14B (TP2/TP4)|Tuned fused_moe configs for A100/H100; benchmarks show +18–20% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +18–14% tokens/s.|Throughput gains published in commit message.|
+
+Next steps (optional):
+
+1. Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that weren’t reported.
+2. For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.
+
+
+
+Model optimization methods:
+- 减少 host/device 之间的 copy
+- 优化 fused_moe configs，让 Triton 算子在特定 E/N/device/quantization 下更高效
+- 提高 CPU/GPU 之间的 overlap
+
+
+
+---
+## notes
+
+Mingxing Zhang:
+
+一个 agent 的不同阶段可能业务需求就不一样，业务需求不同，需要不同 parallelism
+
+value = function / cost
+GPU 利用率已经 70~80%，单一场景空间继续提升很困难，提升多场景下，不同 function 使用不同的 cost，value 最大化
+
+
+mooncake 适配 vllm 和 sglang 时，工程方法不一样，需要不同的做法，需要分别优化适配推理框架
+
+
+---
+## Model Optimization Summary
+
+### Kernel
+
+fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子，保证 SM 的利用率、且 share memory、registers 不爆掉
+
+GDNAttention for Qwen3-Next
+
+### Data Movement
+
+
+
+---
+Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoint’s architecture while staying compatible with speculative decoding and parallelism features.
+
+|Optimization|Match-to-model feature|Performance effect|Evidence/data|
+|---|---|---|---|
+|Gated DeltaNet linear-attention backend with fused gating kernels|Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs|Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance|Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo|
+|Shared Fused MoE with expert parallel load balancing|Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps|Maintains Qwen3 Next’s shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost|Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present|
+|NextN multi-token predictor (MTP) path|Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner|Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled|MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published|
+|Mamba-style state management for speculative decode|Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next|Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models|State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied|
+
+No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.
+
+Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.
+
+
+引入 PD 分离后，EPxDP 时 调度器 影响更大，在线的并行模式调整，可能不如调度器的调整，让每个 rank 收到的 pattern 相对固定
+
--- a/projects/auto-tuner/scrolling.md
+++ b/projects/auto-tuner/scrolling.md
@@ -0,0 +1,17 @@
+尚未发现与暴力枚举本质不同的特征（i.e. 任何更为有效的发现）
+
+How can we reliably leverage general-purpose AI models to optimize real systems under noisy measurements, hard safety constraints, and large discrete configuration spaces—while preventing hallucinated actions and ensuring reproducibility?
+
+
+AI Tuner 的一些问题：
+
+缺少背景知识：会误认为 GPU memory utilization high (~93% HBM) 是错误的，但是事实上 vllm 本身就会固定的基本吃满 GPU memory
+
+AI Tuner 的优点：
+
+能报告 compute-bound（evidence：p95 的 GPU utilization 达到 100%），
+能检测 scheduling 和 batching 做的不好（调整 max_num_batched_tokens 和 max_num_seqs）
+
+
+
+
--- a/projects/auto-tuner/sync2.figs/260410-105227-1.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-1.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-10.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-10.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-11.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-11.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-12.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-12.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-13.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-13.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-14.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-14.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-15.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-15.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-2.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-2.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-3.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-3.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-4.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-4.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-5.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-5.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-6.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-6.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-7.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-7.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-8.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-8.png
--- a/projects/auto-tuner/sync2.figs/260410-105227-9.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227-9.png
--- a/projects/auto-tuner/sync2.figs/260410-105227.png
+++ b/projects/auto-tuner/sync2.figs/260410-105227.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-1.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-1.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-2.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-2.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-3.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-3.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-4.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-4.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-5.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-5.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-6.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-6.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-7.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-7.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-8.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-8.png
--- a/projects/auto-tuner/sync2.figs/260410-105228-9.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228-9.png
--- a/projects/auto-tuner/sync2.figs/260410-105228.png
+++ b/projects/auto-tuner/sync2.figs/260410-105228.png
--- a/projects/auto-tuner/sync2.md
+++ b/projects/auto-tuner/sync2.md