Initial commit: obsidian to gitea
This commit is contained in:
766
projects/auto-tuner/Ali Deployment.md
Normal file
766
projects/auto-tuner/Ali Deployment.md
Normal file
@@ -0,0 +1,766 @@
|
||||
|
||||
- AITuner Trace A | 10 min
|
||||
https://nas.gahow.org/webdav/ai_tuner_logs/dash_prod/ai_tuner_260307-152305.jsonl
|
||||
```
|
||||
2026-03-08 06:28:06 [INFO] run_trace_replayer summary: success=True best_goodput_qps_per_gpu=0.798171666969416 selected_time_scale=0.4678486930206418
|
||||
2026-03-08 06:28:18 [INFO] Agent loop finished. Total runs=5, LLM calls=6
|
||||
2026-03-08 06:28:18 [INFO] Original vs best production config diff:
|
||||
2026-03-08 06:28:18 [INFO] engine_args.block_size: 64 -> 32
|
||||
2026-03-08 06:28:18 [INFO] engine_args.enable_expert_parallel: False -> True
|
||||
2026-03-08 06:28:18 [INFO] engine_args.max_num_batched_tokens: 8192 -> 32768
|
||||
2026-03-08 06:28:18 [INFO] engine_args.tensor_parallel_size: 4 -> 2
|
||||
2026-03-08 06:28:18 [INFO] envs.VLLM_MOE_USE_DEEPEP: '0' -> '1'
|
||||
2026-03-08 06:28:18 [INFO] Baseline vs best production config diff:
|
||||
2026-03-08 06:28:18 [INFO] engine_args.block_size: 64 -> 32
|
||||
2026-03-08 06:28:18 [INFO] engine_args.enable_expert_parallel: False -> True
|
||||
2026-03-08 06:28:18 [INFO] engine_args.max_num_batched_tokens: 8192 -> 32768
|
||||
2026-03-08 06:28:18 [INFO] engine_args.tensor_parallel_size: 4 -> 2
|
||||
2026-03-08 06:28:18 [INFO] envs.VLLM_MOE_USE_DEEPEP: '0' -> '1'
|
||||
2026-03-08 06:28:18 [INFO] Persisted best production config to /usr/local/lib/python3.12/dist-packages/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20.config
|
||||
2026-03-08 06:29:02 [INFO] AI tuner JSONL log: runs/ai_tuner_logs/ai_tuner_260307-152305.jsonl
|
||||
```
|
||||
|
||||
|
||||
```
|
||||
(codex-tuner) wjh@ds-dda11ac6-1-847f4dd4c5-vg6mr:~/auto-tuner/tuner$ cat micro_trace_replayer/comm_tp4_noep_traceA_10min.jsonl.summary.json
|
||||
{
|
||||
"duration_ms": "600078.484",
|
||||
"e2e_mean_ms": "429.185",
|
||||
"e2e_p90_ms": "1163.643",
|
||||
"e2e_p95_ms": "1557.812",
|
||||
"e2e_p99_ms": "2594.410",
|
||||
"output_tokens_total": "2048",
|
||||
"requests_success": "2048",
|
||||
"requests_total": "2048",
|
||||
"throughput_rps": "3.413",
|
||||
"throughput_tps": "3.413",
|
||||
"tpot_mean_ms": "0.000",
|
||||
"ttft_mean_ms": "0.000"
|
||||
}
|
||||
(codex-tuner) wjh@ds-dda11ac6-1-847f4dd4c5-vg6mr:~/auto-tuner/tuner$ cat micro_trace_replayer/ali_tp4_noep_traceA_10min.jsonl.summary.json
|
||||
{
|
||||
"duration_ms": "600082.137",
|
||||
"e2e_mean_ms": "523.779",
|
||||
"e2e_p90_ms": "1478.267",
|
||||
"e2e_p95_ms": "2008.634",
|
||||
"e2e_p99_ms": "3050.164",
|
||||
"output_tokens_total": "2048",
|
||||
"requests_success": "2048",
|
||||
"requests_total": "2048",
|
||||
"throughput_rps": "3.413",
|
||||
"throughput_tps": "3.413",
|
||||
"tpot_mean_ms": "0.000",
|
||||
"ttft_mean_ms": "0.000"
|
||||
}
|
||||
```
|
||||
|
||||
## 现存的问题
|
||||
|
||||
阿里的生产环境上测试的链路是 `API -> Gateway -> Chat Serving (CPU machine for encoding) -> Model Serving (vLLM)`
|
||||
此时存在一个问题,AITuner 不再能直接获取到 vLLM 引擎上的各种性能相关的指标(kvcache hit ratio / kvcache usage / queuing ratio / avg running / avg queuing)
|
||||
|
||||
`"VLLM_ATTENTION_BACKEND": "FLASH_ATTN_VLLM_V1" -> "FLASH_ATTN"` [link](https://code.alibaba-inc.com/PAI-LLM/vllm/commit/2f5a20b76049a33d79bbe2a4ab8abd8ae84581c7?spm=21540d8c.2ca2ac6d.0.0.62cf790b5rvf9T)
|
||||
|
||||
内部存在更多不存在清楚文档定义的中间状态值,tuning 系统对此难以感知,缺少每个 knob 清晰的可取值范围与对应的含义
|
||||
|
||||
|
||||
## 现状总结
|
||||
|
||||
`git log --oneline -- util/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20_prefill.config`
|
||||
|
||||
一个模型(Qwen3-235B-A22B)的 prefill 版本在 25.11.18 - 26.01.08 有 10 次 commit
|
||||
|
||||
|
||||
|
||||
## 补充
|
||||
|
||||
所有修改 configs 的 commit:
|
||||
|
||||
`git log --oneline -- util/vllmgen/configs`
|
||||
|
||||
从 25.07.31 (vllmgen 的首次提交) ~ 26.02.03 这段时间(大约 6 个月)总共有 92 次 commit(平均每 2 天一次 commit)
|
||||
|
||||
|
||||
|
||||
## configs commit 总结
|
||||
|
||||
ecc7b539136cd34d29bf4cdb215ae298c019be76
|
||||
initial commit: common.config
|
||||
|
||||
36faa64b969b633187645b29fa997b177d120f37
|
||||
添加 util/vllmgen/configs/qwen3-235b-a22b-h20-4.config
|
||||
|
||||
929e6bc9c9ac17fdd087ed74c9dadc41c2812d43
|
||||
添加 util/vllmgen/configs/qwen3-235b-a22b/128k-0426-hf-fp4/h20.config、util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config
|
||||
|
||||
cb65b0c128c07725817278f2ddb020ddd2df0722
|
||||
util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config,加长 max_seq_len
|
||||
|
||||
22d382a13baf68e278dd1c616b379d59f34dcdbe
|
||||
util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config,更新 use vllm v1
|
||||
|
||||
==09b005191f10e2b93c3159c6d0d0d6735287269b==
|
||||
util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config 添加:
|
||||
```
|
||||
"compilation_config": {
|
||||
"use_inductor": false,
|
||||
"custom_ops": [
|
||||
"all"
|
||||
]
|
||||
},
|
||||
```
|
||||
|
||||
9007270573a7a6c3a8d88bcc45261751fd2379f3
|
||||
添加 util/vllmgen/configs/draft-models/qwen3-235b-a22b/0723-thinking-eagle-0821/h20.config
|
||||
|
||||
7f392d7e608974fc56f9fd6913d6be3f52ff8ec9
|
||||
更新 util/vllmgen/configs/qwen3-235b-a22b/256k-0723-thinking-fp4/h20.config 和 util/vllmgen/configs/qwen3-30b-a3b/128k-0425-fp4/h20.config 的 `max_num_batched_tokens`
|
||||
添加 util/vllmgen/configs/qwen3-vl-30b-a3b/128k-0717-fp4/h20.config
|
||||
|
||||
4e90f0582cd7f86312e5ae01c9cd7d59132723fd
|
||||
util/vllmgen/configs/qwen3-vl-30b-a3b/128k-0717-fp4/h20.config 中 `enable_prefix_caching` 开启
|
||||
|
||||
4535fd0e39f8a181c73a7669eaa379feae26fe57
|
||||
添加 util/vllmgen/configs/qwen3-vl-235b-a22b/256k-thinking-0920-fp4/h20.config
|
||||
|
||||
60296b817616aaeec919fc9dc68b11dc9f9e9124
|
||||
删除 VLLM_ENABLE_TORCH_COMPILE
|
||||
|
||||
39a9bb43feb6e6b22f7ed8cefe6442d973bf6057
|
||||
typo
|
||||
|
||||
781fee1a1a9ba86228684c1f2a1e0f9726ab3e2b
|
||||
支持 qwen-coder/max/next
|
||||
|
||||
b1c5efe09ef9df162d442fc7b71932d41f3a426b
|
||||
change meta
|
||||
|
||||
a3389997e420de0b9c45342be6f806ff9179e193
|
||||
use `PIECEWISE` compilation_config for cudagraph_mode
|
||||
|
||||
dc0761d070e9068f64e3c33926fe3bc0c7d62f7b
|
||||
remove unused TP size
|
||||
|
||||
dc83943aa9f3b2d4c90dde3c6f2fe49458877ba1
|
||||
bug fix set VLLM_USE_UPSTREAM_FA3=0
|
||||
|
||||
216194960c290d64914ffad88338587069889e6c
|
||||
删除 `PIECEWISE`
|
||||
|
||||
b75c0cf11dedcf4ee0396ec939e1e0e80267593f
|
||||
typo
|
||||
|
||||
a09d8da72fc9ca837862b61850cdb324223cfd76
|
||||
typo
|
||||
|
||||
34b32eb885ecb07296d7a2bbe58d5e0ee028d30c
|
||||
添加:
|
||||
```
|
||||
"VLLM_USE_FLASHINFER_SAMPLER": "0"
|
||||
"VLLM_ENABLE_TORCH_COMPILE": "1"
|
||||
```
|
||||
|
||||
adba559382c4ab60738a0b2783e27acc801d2ab7
|
||||
设置 `"VLLM_FLASH_ATTN_USE_UPSTREAM": "0"`
|
||||
|
||||
517bc31caaf26c6f1cd119a5283f2e5b2cb141c7
|
||||
使用
|
||||
```
|
||||
"compilation_config": {
|
||||
"cudagraph_mode": "PIECEWISE",
|
||||
"use_inductor": false,
|
||||
"custom_ops": [
|
||||
"all"
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
4259a99557467b48cbae606c0092bf0b75ab3f84
|
||||
meta
|
||||
|
||||
e1d1b64420426406afcfe3a36d10172b6473ead8
|
||||
修改 gpu memory util 避免 kvs swap 的 OOM
|
||||
|
||||
13294741780ff79bbab8a54e98286e07a1ba4b0e
|
||||
meta
|
||||
|
||||
5b2f35fe49033711c9cfcbd2afe7036905f462a6
|
||||
`"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
|
||||
|
||||
ea95b933221ba192d9e0a2a06e6c84bee4a32544
|
||||
`"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
|
||||
|
||||
4994e414d8ee812feebe23eed473abcf5166f246
|
||||
```
|
||||
"VLLM_FLASH_ATTN_FP8_ATTENTION": "1",
|
||||
"VLLM_ENABLE_TORCH_COMPILE": "1"
|
||||
```
|
||||
|
||||
985ffa6156c37f8a299870722f51bdfec9fba2f4
|
||||
添加 5090 support util/vllmgen/configs/qwen3-30b-code-switch/v2/5090.config
|
||||
|
||||
ae600ea07735cbe03b1b862054133d805f85ed5e
|
||||
支持 PD 分离:util/vllmgen/configs/qwen3-235b-a22b/256k-0717/h20_prefill.config
|
||||
|
||||
98038e62d8f3f7b94079cc287bdf10e4f4c64267
|
||||
支持一系列 qwen-vl
|
||||
|
||||
2f5a20b76049a33d79bbe2a4ab8abd8ae84581c7
|
||||
typo fix
|
||||
|
||||
cc3484b98cfeccc8c9d6426c4f1238545832cc14
|
||||
删除 `"VLLM_VL_VISION_NUM_GRID_PER_SIDE": "27"`
|
||||
|
||||
d71559b5f3fd6084bc738c4dc22af0566cf133eb
|
||||
meta
|
||||
|
||||
3ec865d4b2db0cd98b537124d5ca791e458cba93
|
||||
支持一系列闭源 qwen-vl 模型
|
||||
|
||||
31a07b6eceb0bda01dcd0db4da0e4d390251facb
|
||||
no bladnn gdn by default
|
||||
|
||||
896de6b72d079bf4d9c568667189c3d711e9ebde
|
||||
meta
|
||||
|
||||
213d89408b1c45d6ae5ed6593268dc41d07eb12e
|
||||
use `VLLM_FLASH_ATTN_FP8_ATTENTION`
|
||||
|
||||
de918bd2d52e30cba3cff640af3f1e73fdb841da
|
||||
add qwen3-vl-235b-a22b pd
|
||||
|
||||
7835de1c4fffb8f2c5ad1a8af3501d1dfcc60728
|
||||
support qwen-vl-max dashgen on L20X and A800
|
||||
|
||||
b3d465adb53e993bc4bfc069ad869232fb36823d
|
||||
make draft model support PD decoupling
|
||||
|
||||
dcc0ab7fc9ed21c0763086f1ae1428633f53ba8e
|
||||
fix rope scaling 的位置
|
||||
|
||||
==f66a954d5e68c829f7d841ed9aa020ec45d11436==
|
||||
修改 `gpu_memory_utilization`, `max_num_batched_tokens`, `cuda_graph_sizes` (P 和 D 不同), `cudagraph_mode`
|
||||
|
||||
75ab97b48ffab568b69f9dfd34ffbee4d10c771b
|
||||
meta
|
||||
|
||||
f67b2e2ede2b88bf27827a8edd96e38188bccb69
|
||||
移除 "async_scheduling": false 和 "disable_custom_all_reduce": true
|
||||
|
||||
9c497637ad23eb9a9abaf53327e4ff263a4630ea
|
||||
Encode-Prefill-Decode Disaggregation on v6d
|
||||
|
||||
b739c437bdac2ba0751853ed4335019f7b7e8486
|
||||
移除 "async_scheduling": false
|
||||
|
||||
85a66eb3a241f900ea785ae0a9b2ac5981f9e9b6
|
||||
支持 util/vllmgen/configs/qwen3-30b-a3b/1m-thinking-0728-fp4-cs-gate/h20.config
|
||||
|
||||
44c01d5177f36e4b0c2802b6aac8439136f79685
|
||||
支持 util/vllmgen/configs/qwen3-omni/qwen3_omni_final_ckpt_multilingual_0915/h20.config
|
||||
|
||||
936f71e642edd3a87286deabff38ae73c7bd3aea
|
||||
修改 mm_processor_cache_gb 50 -> 10
|
||||
|
||||
==0c8dab3e33fc93e98fa15a37b4aa784aa1316abc==
|
||||
降低 decode max_num_batched_tokens 8192 -> 1024, 提高 decode max_num_seqs 128 -> 192, 提高 prefill max_num_batched_tokens 4096 -> 8192,修改 decode 的 cuda_graph_sizes,增加 128 和 192
|
||||
|
||||
ca65f85a37efa5ef95912d1618b0f5405df53c7c
|
||||
support moe eplb
|
||||
|
||||
741d586dd1fc0bc98c16fd16146d1f77e72ba4e8
|
||||
支持 qwen3-max-fp pd
|
||||
|
||||
49486c6187263204427074780ac22b4aef475a55
|
||||
meta
|
||||
|
||||
d62a67eeb1eba461fa434a0297717c838d1e7d7d
|
||||
支持 kvs,gpu_memory_utilization 0.89 -> 0.88
|
||||
|
||||
e1fdbf4629f8b70463ae45cc2b0e408712e60af4
|
||||
add qwen3-vl-plus pd
|
||||
|
||||
6b18610c39857c90d5e076b8b693ab847b92a107
|
||||
PD use eth1
|
||||
|
||||
0cb6b11757a5eb46f6a021cee8b7f08b84256410
|
||||
支持 qwen3-plusp
|
||||
|
||||
8915d3fe5c587c2268dfa49dfbf022d2bfd6c7b0
|
||||
更新 qwen-vl-max
|
||||
|
||||
383ce3e60eb9904faa61006452b2be4ff250ec24
|
||||
chore:mm_processor_cache_gb 含义 false -> 0
|
||||
|
||||
8b30846690623a80b51e975718a84a0d571a97ed
|
||||
添加支持:util/vllmgen/configs/qwen-vl-max/epd_disagg_llm/h20.config
|
||||
|
||||
de80aaf239033cf0d59e32e0bf3beee16651fed7
|
||||
移除 qwen-vl-max 在 a800 上使用 fp8(不被 a800 支持)
|
||||
|
||||
9d63c50b3328a1d748929fb4f76cff49ac883b7d
|
||||
chore:mm_processor_cache_gb 含义 false -> 0
|
||||
|
||||
==2fe8ab568315c227cfbade4e7759b1767cfa003c==
|
||||
util/vllmgen/configs/qwen3-coder-plus/1m-0922-re-fp8/h20.config 设置 TP4,关闭 kv transfer
|
||||
|
||||
5a12337e3f38b0d11b21604a1f3f5e739d6b126b
|
||||
chore: 移除 num_experts: 0
|
||||
|
||||
5f591ffe74f15fe9e5b4cd5864471b3ffa76874a
|
||||
将模型解码配置中的松散 top-k 验证关闭,并调整了异步调度为禁用状态,同时略微增加了 GPU 内存利用率
|
||||
|
||||
2e87a5d3e86e597efea8478cc74dd92a9d6dba0b
|
||||
支持 coder plus 0923 pd + kvs
|
||||
|
||||
bdbdf44d4c67462ca5d120b25684a5b8804a7a18
|
||||
因为今天的 commit 更新了 fp4 perchannel 的加载逻辑,所以需要对模型配置进行更新
|
||||
|
||||
85c50bf76194bf52dc2e3e1ce7c8b2e0e17cbfc1
|
||||
添加 loose topk 相关
|
||||
```
|
||||
"use_loose_topk_verify": 1,
|
||||
"loose_topk_verify_threshold": 0.95,
|
||||
"loose_topk_verify_max_tokens": 2000,
|
||||
```
|
||||
|
||||
d350466cf7d71b4dbcd03bb8ff1ec60b648940a6
|
||||
chore: VLLM_KVT_MAX_DELAY_MS
|
||||
|
||||
936a3b7286425fdd7dc996b6b4f23f19cabdb83b
|
||||
支持 util/vllmgen/configs/draft-models/ci-test/model-version-v1/h20_batch.config
|
||||
|
||||
b3d96f257d229ab7b8d6d6a1b5f92f7d740acd7b
|
||||
util/vllmgen/configs/qwen3-30b-a3b-with-gate-next-fp4/instruct-fp4/h20.config 支持 kvs,`max_num_batched_tokens`, 1024 -> 4096
|
||||
|
||||
1ca19f8042d081c437bf036ca469c946c6c30868
|
||||
添加
|
||||
```
|
||||
"pass_config": {
|
||||
"fuse_norm_quant": false,
|
||||
"fuse_act_quant": false,
|
||||
"fuse_attn_quant": false
|
||||
}
|
||||
```
|
||||
|
||||
80e602a32f0cd59f84e834049161dff5a1aaa6a8
|
||||
支持 util/vllmgen/configs/qwen3-30b-code-switch/v2/5090.config,移除 flashinfer
|
||||
|
||||
7018965a139a5763809e888eb0c35ef9743ad3d8
|
||||
添加 util/vllmgen/configs/qwen3-235b-a22b/256k-0723-think-cs/h20_prefill.config,更新所有 model 的 PD connection 相关环境变量
|
||||
|
||||
6a1b40876d4bc884a805effa0cc7c7a28c38ea01
|
||||
plusp 等模型支持 kvs
|
||||
|
||||
613d885a17360b9f90c10bedf8d900815d8b5668
|
||||
util/vllmgen/configs/qwen3-plusp/256k-1106/h20_prefill.config,移除 "enforce_eager": false 与 cuda_graph_sizes
|
||||
|
||||
29f88ea3e6addf5a155c2bf000e7819171be0b39
|
||||
fix
|
||||
|
||||
1ad174a24cad5e51e2ccf71a828751d4e65c51a6
|
||||
util/vllmgen/configs/qwen3-omni/qwen3_omni_final_ckpt_multilingual_1124/h20.config, enable_chunked_prefill 开启,添加 max_model_len: 65535
|
||||
|
||||
c8f776e277a6545b3e1696a4e9261282bca681f9
|
||||
Disable the preprocess cache for vl model by default
|
||||
|
||||
9d011bf9ec443f7d3ccecd7e44b6d9a526a0089a
|
||||
增加了任务绕过配置项 `VLLM_ENABLE_BYPASS_TASK` 并调整了异步调度、GPU内存利用率及前缀缓存的相关设置,关闭 enable_prefix_caching
|
||||
|
||||
dab633c1871054cfe823669e970b04d89ba3b7b5
|
||||
删除:
|
||||
```
|
||||
"DS_LLM_ENABLE_DISAGGREGATED_VIT": "1",
|
||||
"DS_LLM_LAUNCH_VINEYARD": "0",
|
||||
"SRPC_STREAM_DISABLE_BAREX": "1"
|
||||
```
|
||||
|
||||
67ab53a5f673f4fc629c2a4354e2b3ddf556dc08
|
||||
Support Qwen3.5 VL MTP
|
||||
|
||||
27d64ee33ecac2b94efaa44f5bbefa5998574d05
|
||||
Add KVS config for Qwen3-VL-Plus
|
||||
|
||||
8fb2d746069f17e5c6e951b0fd2f5da4c608f3e1
|
||||
util/vllmgen/configs/qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config, max_num_seqs 256 -> 512
|
||||
|
||||
ebf2b71630984c383f3824eb7e4e45a617416274
|
||||
"max_num_batched_tokens": 8192 for util/vllmgen/configs/qwen3-vl-plus, prefill
|
||||
|
||||
c18e84e20cc15bbc2d14ddb17b26aaf7dc78e94f
|
||||
添加 `speculative_config`:
|
||||
```
|
||||
"num_speculative_tokens": 32,
|
||||
"hf_overrides": {
|
||||
"max_position_embeddings": 262144
|
||||
}
|
||||
```
|
||||
|
||||
3bef0c622d1ec18e652e82cbc24ebf481e9025d5
|
||||
支持 qwen3-max on l20c
|
||||
|
||||
ce495cc242affcfb321cf6dcabf190c3f1589667
|
||||
qwen3-vl-flash: enable async_schedule
|
||||
|
||||
8152532eeb0a191e69b868cc888392a0c74ab0fe
|
||||
qwen3-vl-flash 添加
|
||||
```
|
||||
"pass_config": {
|
||||
"fuse_norm_quant": false,
|
||||
"fuse_act_quant": false,
|
||||
"fuse_attn_quant": false
|
||||
},
|
||||
```
|
||||
|
||||
|
||||
|
||||
下面这份结论,是**基于你给出的 commit 摘要**做的归类分析,不是逐个 diff 逐行审阅后的法证级结论;但从系统角度看,已经足够支持一个很清晰的判断:**这个仓库里的 config churn,大多数不是在做真正的性能调优,而是在做模型接入、功能启用、兼容修复和运行稳定性修补。**
|
||||
|
||||
# Commit Analysis
|
||||
## 1. 结论先说
|
||||
|
||||
你列出的 commit 一共 **91 个**。
|
||||
|
||||
我建议用两个口径看“性能相关占比”:
|
||||
|
||||
| 口径 | 定义 | 数量 | 占比 |
|
||||
| ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------: | --------: |
|
||||
| 严格口径 | **直接修改已有配置中的性能 flag 取值**,例如 `max_num_batched_tokens`、`max_num_seqs`、`gpu_memory_utilization`、`async_scheduling`、`cuda_graph_sizes`、`torch compile`、`prefix caching` 等 | **28** | **30.8%** |
|
||||
| 宽口径 | 在严格口径基础上,再加上**新引入性能相关机制/后端/并行形态**,例如 speculative、FP8 attention、PD/KVS 相关配置接入等 | **约 40** | **44.0%** |
|
||||
| | | | |
|
||||
|
||||
也就是说:
|
||||
|
||||
* **如果你关心的是“真正做 tuning 的 commit”**,占比大约只有 **三成**;
|
||||
* **如果你把性能相关 feature enablement 也算进去**,最多也就 **四成多**;
|
||||
* 因此,“**绝大部分 commit 与性能优化本身无关**”这个观察,在严格口径下是成立的。
|
||||
|
||||
这其实很有价值:它说明**生产配置仓库的变化,主要是 control-plane churn,而不是 dataplane tuning**。
|
||||
|
||||
---
|
||||
|
||||
## 2. 从第一性原理看,为什么会这样
|
||||
|
||||
一个线上 vLLM per-model config 仓库,commit 变多,未必意味着“性能状态空间巨大”。更常见的是四类变化:
|
||||
|
||||
1. **接入新模型 / 新硬件 / 新模式**
|
||||
例如新增 qwen3-vl、omni、5090、A800、L20C、PD、KVS、MTP。
|
||||
这类变化本质上是“能不能跑 / 怎么接入”,不是“已经能跑后如何把 goodput 调到最优”。
|
||||
|
||||
2. **兼容性与正确性修复**
|
||||
例如 typo、rope scaling 位置修复、fp4 per-channel 加载逻辑变化、某后端 upstream 不可用、某硬件不支持 fp8。
|
||||
这类变化的目标是“避免错 / 避免 crash / 避免不支持”。
|
||||
|
||||
3. **稳定性与资源安全边界调整**
|
||||
例如 `gpu_memory_utilization` 从 0.89 到 0.88,避免 KVS swap OOM。
|
||||
这类变化和性能有关系,但很多时候**优先级是 survivability,不是 peak throughput**。
|
||||
|
||||
4. **真正的 steady-state 性能调优**
|
||||
例如:
|
||||
|
||||
* `max_num_batched_tokens`
|
||||
* `max_num_seqs`
|
||||
* `async_scheduling`
|
||||
* `cuda_graph_sizes`
|
||||
* `cudagraph_mode`
|
||||
* `torch compile`
|
||||
* `prefix caching`
|
||||
* `chunked_prefill`
|
||||
* `TP`
|
||||
|
||||
这类 commit 才是你论文里最该抓住的部分。
|
||||
|
||||
---
|
||||
|
||||
## 3. 哪些 commit 真正改了性能相关 flag
|
||||
|
||||
我把它们按系统机制分组。
|
||||
|
||||
### A. batching / scheduling / graph capture 相关
|
||||
|
||||
这组是最“像 tuning”的。
|
||||
|
||||
* `7f392d7e...`
|
||||
修改 `max_num_batched_tokens`
|
||||
|
||||
* `f66a954d...`
|
||||
修改 `gpu_memory_utilization`、`max_num_batched_tokens`、`cuda_graph_sizes`、`cudagraph_mode`,并区分 P / D
|
||||
|
||||
* `0c8dab3e...`
|
||||
decode: `max_num_batched_tokens 8192 -> 1024`
|
||||
decode: `max_num_seqs 128 -> 192`
|
||||
prefill: `max_num_batched_tokens 4096 -> 8192`
|
||||
同时修改 decode `cuda_graph_sizes`
|
||||
|
||||
* `f67b2e2e...`
|
||||
移除 `"async_scheduling": false`、`"disable_custom_all_reduce": true`
|
||||
|
||||
* `b739c437...`
|
||||
移除 `"async_scheduling": false`
|
||||
|
||||
* `5f591ffe...`
|
||||
关闭 loose top-k verify,关闭 async scheduling,略微增加 `gpu_memory_utilization`
|
||||
|
||||
* `8fb2d746...`
|
||||
decode `max_num_seqs 256 -> 512`
|
||||
|
||||
* `ebf2b716...`
|
||||
prefill `max_num_batched_tokens = 8192`
|
||||
|
||||
* `ce495cc2...`
|
||||
qwen3-vl-flash 开启 `async_schedule`
|
||||
|
||||
* `1ad174a2...`
|
||||
开启 `enable_chunked_prefill`,增加 `max_model_len`
|
||||
|
||||
* `613d885a...`
|
||||
移除 `enforce_eager: false` 与 `cuda_graph_sizes`
|
||||
|
||||
#### 这类变化背后的理由
|
||||
|
||||
这是最典型的系统 tradeoff:
|
||||
|
||||
* `max_num_batched_tokens` 决定**单 step 的 token 工作量上限**。
|
||||
增大它,通常有助于提高 GPU occupancy 和算子摊薄开销;但也会带来更高显存压力、更长单步尾延迟、更难 graph capture。
|
||||
|
||||
* `max_num_seqs` 决定**并发宽度**。
|
||||
decode 场景中,增大它往往是为了让更多序列并发推进,提升 tokens/s;但太大时 scheduler overhead、KV 压力、尾延迟都会上来。
|
||||
|
||||
* `prefill` 和 `decode` 的最优点本来就不一样。
|
||||
`0c8dab3e...` 这种“prefill 提大 token budget,decode 降低 token budget 但提高 seq 数”的改法,系统上非常合理:
|
||||
|
||||
* prefill 更偏大矩阵、吃吞吐、适合更大的 batch-tokens
|
||||
* decode 更偏 memory/bandwidth、每步 token 少、适合更高并发序列数而不是更大 token lump
|
||||
|
||||
* `cuda_graph_sizes` / `cudagraph_mode` 本质是在赌**shape 是否足够稳定**。
|
||||
如果 workload 形状集中,graph capture 能减少 launch overhead;
|
||||
如果 shape 太散,graph 容易失效甚至成为负担,所以会看到 `PIECEWISE` 被加上又删掉。
|
||||
|
||||
* `async_scheduling` 不是“永远更快”。
|
||||
它在高并发下可能更好,但也可能引入调度抖动、与某些校验逻辑冲突、或在某些模型/路径上带来额外复杂性,所以会出现有的模型开启、有的关闭。
|
||||
|
||||
---
|
||||
|
||||
### B. memory / cache / 长上下文相关
|
||||
|
||||
* `cb65b0c1...`
|
||||
加长 `max_seq_len`
|
||||
|
||||
* `4e90f058...`
|
||||
开启 `enable_prefix_caching`
|
||||
|
||||
* `e1d1b644...`
|
||||
调整 `gpu_memory_utilization`,避免 KVS swap OOM
|
||||
|
||||
* `d62a67ee...`
|
||||
支持 KVS 时,`gpu_memory_utilization 0.89 -> 0.88`
|
||||
|
||||
* `936f71e6...`
|
||||
`mm_processor_cache_gb 50 -> 10`
|
||||
|
||||
* `383ce3e6...`, `9d63c50b...`
|
||||
`mm_processor_cache_gb false -> 0`
|
||||
|
||||
* `c8f776e2...`
|
||||
默认关闭 VL preprocess cache
|
||||
|
||||
* `9d011bf9...`
|
||||
调整 `gpu_memory_utilization`、关闭 `enable_prefix_caching`
|
||||
|
||||
* `5b2f35fe...`, `ea95b933...`, `cc3484b9...`
|
||||
`VLLM_VL_VISION_NUM_GRID_PER_SIDE = 27`,随后删除
|
||||
|
||||
#### 背后的理由
|
||||
|
||||
这组变化本质上都在处理一个问题:
|
||||
|
||||
> **显存不是只给模型权重和 KV cache 用的。**
|
||||
> 在长上下文、VL、KVS、PD 这些场景里,预处理缓存、视觉网格、prefix cache、KV 传输缓冲都会争显存。
|
||||
|
||||
所以你会看到两个典型方向:
|
||||
|
||||
1. **给显存留安全边界**
|
||||
`gpu_memory_utilization` 下调,不是因为更快,而是因为再高就会触发 swap / OOM / 碎片化恶化。
|
||||
这类 commit 是“性能相关”,但更准确说是**容量边界调优**。
|
||||
|
||||
2. **缓存未必总是划算**
|
||||
`prefix caching`、`preprocess cache`、`mm_processor_cache_gb` 只有在命中率高时才值得。
|
||||
在多模态、长上下文、任务混杂的真实线上场景里,缓存可能带来:
|
||||
|
||||
* 显存占用
|
||||
* 管理开销
|
||||
* 缓存污染
|
||||
* 命中不高
|
||||
|
||||
所以会出现“先开后关”或“默认关闭”的变化,这说明**它不是 universally beneficial 的 knob**。
|
||||
|
||||
---
|
||||
|
||||
### C. kernel / backend / compile path 相关
|
||||
|
||||
* `09b00519...`
|
||||
`compilation_config.use_inductor = false`,`custom_ops = ["all"]`
|
||||
|
||||
* `60296b81...`
|
||||
删除 `VLLM_ENABLE_TORCH_COMPILE`
|
||||
|
||||
* `a3389997...`
|
||||
使用 `PIECEWISE` 的 `cudagraph_mode`
|
||||
|
||||
* `21619496...`
|
||||
删除 `PIECEWISE`
|
||||
|
||||
* `34b32eb8...`
|
||||
`VLLM_USE_FLASHINFER_SAMPLER = 0`
|
||||
`VLLM_ENABLE_TORCH_COMPILE = 1`
|
||||
|
||||
* `adba5593...`
|
||||
设置 `VLLM_FLASH_ATTN_USE_UPSTREAM = 0`
|
||||
|
||||
* `dc83943a...`
|
||||
设置 `VLLM_USE_UPSTREAM_FA3 = 0`
|
||||
|
||||
* `4994e414...`, `213d8940...`
|
||||
启用 `VLLM_FLASH_ATTN_FP8_ATTENTION`
|
||||
|
||||
* `1ca19f80...`, `8152532e...`
|
||||
`pass_config.fuse_* = false`
|
||||
|
||||
* `31a07b6e...`
|
||||
no bladnn gdn by default
|
||||
|
||||
#### 背后的理由
|
||||
|
||||
这一组非常像现实生产环境中的“后端路径钉死”:
|
||||
|
||||
* 理论上更激进的 backend / fusion / compile path **不一定更快,更不一定更稳**;
|
||||
* 一旦涉及 FP4/FP8、VL、quant、特殊注意力后端、特定硬件,实际最优路径往往依赖非常具体的组合条件。
|
||||
|
||||
所以这类 commit 的真实含义通常不是:
|
||||
|
||||
> “发现了一个更高级的性能优化”
|
||||
|
||||
而是:
|
||||
|
||||
> “某条后端路径在这个模型/精度/硬件组合上不稳、不可用、或者综合收益不好,所以钉到已验证的实现上”。
|
||||
|
||||
这类 commit 当然影响性能,但其本质更接近**后端选择与兼容性收敛**,不是典型意义上的 black-box tuning。
|
||||
|
||||
---
|
||||
|
||||
### D. 并行形态 / 系统结构相关
|
||||
|
||||
* `2fe8ab56...`
|
||||
`TP=4`,关闭 KV transfer
|
||||
|
||||
* `85c50bf7...`, `5f591ffe...`
|
||||
loose top-k verify 相关开关,后来关闭
|
||||
|
||||
* `c18e84e2...`
|
||||
增加 `speculative_config.num_speculative_tokens = 32`
|
||||
|
||||
* `ae600ea0...`, `de918bd2...`, `b3d465ad...`, `9c497637...`, `741d586d...`, `e1fdbf46...`, `2e87a5d3...`, `27d64ee3...`
|
||||
PD / EPD / KVS 相关支持
|
||||
|
||||
#### 背后的理由
|
||||
|
||||
这一组说明一个更重要的事实:
|
||||
|
||||
> 真正影响 serving config 的,常常不是单个 flag,而是**系统结构变了**。
|
||||
|
||||
例如:
|
||||
|
||||
* 一旦引入 **PD/EPD**,prefill 和 decode 的最优 `max_num_batched_tokens`、`max_num_seqs`、`cuda_graph_sizes` 就不该相同;
|
||||
* 一旦引入 **KVS**,你必须重新分配显存 headroom;
|
||||
* 一旦使用 **speculative decoding**,decode 最优点就会改变;
|
||||
* `TP` 的选择本质上是**算力并行收益 vs 通信开销 vs KV 布局 vs 单机容量**的平衡。
|
||||
|
||||
所以这些 commit 说明的不是“flag 很多”,而是:
|
||||
|
||||
**配置空间真正复杂的来源,是系统形态切换,而不是几十个 flag 都在频繁被精细调参。**
|
||||
|
||||
---
|
||||
|
||||
## 4. 真正被反复调的“核心性能 flag”其实不多
|
||||
|
||||
从你这批 commit 看,反复出现的核心性能 knob 主要就这些:
|
||||
|
||||
* `max_num_batched_tokens`
|
||||
* `max_num_seqs`
|
||||
* `gpu_memory_utilization`
|
||||
* `async_scheduling`
|
||||
* `cuda_graph_sizes`
|
||||
* `cudagraph_mode`
|
||||
* `VLLM_ENABLE_TORCH_COMPILE`
|
||||
* `enable_prefix_caching`
|
||||
* `mm_processor_cache_gb`
|
||||
* `TP`
|
||||
* 若干 backend 选择:
|
||||
|
||||
* `VLLM_FLASH_ATTN_FP8_ATTENTION`
|
||||
* `VLLM_FLASH_ATTN_USE_UPSTREAM`
|
||||
* `VLLM_USE_UPSTREAM_FA3`
|
||||
* `VLLM_USE_FLASHINFER_SAMPLER`
|
||||
* `pass_config.fuse_*`
|
||||
|
||||
这对你的故事很重要:
|
||||
**配置文件里看起来 flag 很多,但真正反复进入性能决策回路的,是一个相对低维、重复出现的核心子集。**
|
||||
|
||||
---
|
||||
|
||||
## 5. 这批 commit 最能支持你的论文/文档里的什么结论
|
||||
|
||||
我建议你把结论写成下面这个脉络:
|
||||
|
||||
### 结论 A:生产环境 config churn 很大,但大部分不是性能 tuning
|
||||
|
||||
因为大量 commit 属于:
|
||||
|
||||
* 新模型接入
|
||||
* 新硬件适配
|
||||
* PD/KVS/VL/MTP 等功能接入
|
||||
* typo / meta / chore
|
||||
* 正确性与兼容性修复
|
||||
|
||||
所以**不能把“commit 多”直接等价成“性能配置空间巨大且持续被深度优化”**。
|
||||
|
||||
### 结论 B:真正的性能 tuning 集中在少数几个 recurring knobs 上
|
||||
|
||||
主要就是:
|
||||
|
||||
* batching 宽度
|
||||
* scheduling 策略
|
||||
* graph/compile 开关
|
||||
* memory headroom
|
||||
* cache 开关
|
||||
* backend path
|
||||
* TP / PD / KVS 等结构性选择
|
||||
|
||||
也就是说,**性能空间不是“很多杂乱 flag 的完全黑盒空间”,而是有明显系统结构的。**
|
||||
|
||||
### 结论 C:许多所谓“性能相关 commit”,本质是在修资源边界和后端稳定性
|
||||
|
||||
例如:
|
||||
|
||||
* `gpu_memory_utilization` 下调避免 OOM
|
||||
* 关闭 prefix / preprocess cache
|
||||
* 钉死某个 flash-attn 或 compile backend
|
||||
* 删除不稳定的 graph mode
|
||||
|
||||
这说明线上 config 演化的第一目标往往是:
|
||||
|
||||
> **先稳定可运行,再在少数核心 knob 上做局部优化。**
|
||||
|
||||
---
|
||||
|
||||
## 6. 我给你的最终判断
|
||||
|
||||
你的原始直觉是对的,而且这个观察其实很强:
|
||||
|
||||
> **vLLM per-model config 的 commit 很多,但绝大多数并不是“性能相关的精细调优”;真正持续影响性能的,是少数几个 recurring flag,以及少数系统结构切换(PD/KVS/TP/speculative)。**
|
||||
|
||||
如果你愿意,我下一步可以直接把这些内容整理成一份更正式的中文内部文档,结构做成:
|
||||
|
||||
**背景 → 方法 → commit 分类统计 → 高频性能 flag → 每类 flag 的系统解释 → 对 AITuner 研究的启示**
|
||||
|
||||
这样你可以直接发给同事或 supervisor。
|
||||
Reference in New Issue
Block a user