121 lines
5.9 KiB
Markdown
121 lines
5.9 KiB
Markdown
# TODO
|
||
- [ ] 分析开启 EP 性能好坏的本质原因
|
||
- [ ] 分析内部频繁修改 cuda_graph_sizes,会对性能有什么影响,怎么对应到 engine 性能
|
||
|
||
workload prefill 强度的定义,考虑三个维度:
|
||
- 总 prefill tokens
|
||
- request input length 的分布(长度究竟怎么影响?)
|
||
- burst(QPS CV)
|
||
|
||
|
||
|
||
`export VLLM_DISABLE_COMPILE_CACHE=1`
|
||
|
||
|
||
`export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"4":0}'`
|
||
|
||
|
||
|
||
总 prefill load 增大,但是从 TP1 -> TP2,不仅 TPS 增大,而且 latency 降低
|
||
|
||
| **window_id** | trace_type | time_scale | prefill_tokens_per_second | real_prefill_tokens_per_second | prefix_cache_hit_tokens_per_second | total_prefill_tokens | total_real_prefill_tokens | total_prefix_cache_hit_tokens | prefix_cache_hit_rate | config_id | tensor_parallel_size | max_num_seqs | max_num_batched_tokens | goodput | goodput_per_gpu | num_slo_pass | total_requests | num_errors | experiment_duration | mean_ttft | p95_ttft | started_at | finished_at | log_path | server_log_path | num_good_tokens | good_token_per_second | good_token_per_second_per_gpu |
|
||
| ------------- | ---------- | ---------- | ------------------------- | ------------------------------ | ---------------------------------- | -------------------- | ------------------------- | ----------------------------- | --------------------- | --------- | -------------------- | ------------ | ---------------------- | ----------------- | ----------------- | ------------ | -------------- | ---------- | ------------------- | ------------------ | ------------------ | ------------------ | ------------------ | ------------------------------------------------- | ------------------------------------------------------ | --------------- | --------------------- | ----------------------------- |
|
||
| **chat_w06** | chat | 1.0 | 12045.351666666700 | 6850.685 | 5194.666666666670 | 7227211 | 4110411 | 3116800 | 0.4312590292437840 | A | 1 | 32 | 8192 | 5.426623109964860 | 5.426623109964860 | 3258 | 3436 | 0 | 600.373369216919 | 0.735489982664932 | 2.7120952010154700 | 1773196637.8148000 | 1773197238.188170 | results/experiment_logs/chat_w06__config_A.jsonl | results/experiment_logs/chat_w06__config_A.server.log | 7119724 | 11858.82713166710 | 11858.82713166710 |
|
||
| **coder_w02** | coder | 1.0 | 31264.74 | 14433.46 | 16831.280000000000 | 18758844 | 8660076 | 10098768 | 0.5383470324717240 | B | 2 | 64 | 16384 | 5.418584987116260 | 2.709292493558130 | 3252 | 3264 | 1 | 600.1566843986510 | 0.6616424000997540 | 2.0865608930587800 | 1773206129.1402500 | 1773206729.2969400 | results/experiment_logs/coder_w02__config_B.jsonl | results/experiment_logs/coder_w02__config_B.server.log | 18745884 | 31234.98327571430 | 15617.49163785710 |
|
||
|
||
|
||
|
||
|
||
- EP off
|
||
```
|
||
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": false'
|
||
qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
|
||
38: "enable_expert_parallel": false,
|
||
|
||
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
|
||
74: "enable_expert_parallel": false,
|
||
|
||
qwen3-plusp/256k-1106/h20_prefill.config
|
||
43: "enable_expert_parallel": false,
|
||
|
||
qwen3-235b-a22b/256k-0717/h20_prefill.config
|
||
43: "enable_expert_parallel": false,
|
||
|
||
qwen3-max/256k-0922-fp4/h20_prefill.config
|
||
43: "enable_expert_parallel": false,
|
||
|
||
qwen3-235b-a22b/256k-0723-think-cs/h20_prefill.config
|
||
44: "enable_expert_parallel": false,
|
||
```
|
||
|
||
- EP on
|
||
```
|
||
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": true'
|
||
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
|
||
74: "enable_expert_parallel": true,
|
||
|
||
qwen3-235b-a22b/256k-0717/h20_decode.config
|
||
76: "enable_expert_parallel": true,
|
||
|
||
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
|
||
76: "enable_expert_parallel": true,
|
||
|
||
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
|
||
44: "enable_expert_parallel": true,
|
||
|
||
qwen3-plusp/256k-1106/h20_decode.config
|
||
76: "enable_expert_parallel": true,
|
||
|
||
qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
|
||
77: "enable_expert_parallel": true,
|
||
```
|
||
|
||
- TP
|
||
```
|
||
gahow@Gahow-MBA configs % rg '"tensor_parallel_size"'
|
||
qwen3-max/256k-0922-fp4/h20_prefill.config
|
||
51: "tensor_parallel_size": 8,
|
||
|
||
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
|
||
84: "tensor_parallel_size": 8,
|
||
|
||
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
|
||
84: "tensor_parallel_size": 8,
|
||
|
||
qwen3-coder-plus/1m-0922-re-fp8/h20.config
|
||
14: "tensor_parallel_size": 4,
|
||
|
||
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
|
||
96: "tensor_parallel_size": 8,
|
||
|
||
qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
|
||
43: "tensor_parallel_size": 4,
|
||
|
||
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
|
||
84: "tensor_parallel_size": 4,
|
||
|
||
qwen3-30b-a3b-with-gate-next-fp4/instruct-fp4/h20.config
|
||
28: "tensor_parallel_size": 1,
|
||
```
|
||
|
||
- DP
|
||
```
|
||
gahow@Gahow-MBA configs % rg '"data_parallel_size"'
|
||
qwen3-plusp/256k-1106/h20_decode.config
|
||
85: "data_parallel_size": 8,
|
||
|
||
qwen3-235b-a22b/256k-0717/h20_decode.config
|
||
85: "data_parallel_size": 8,
|
||
|
||
qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
|
||
86: "data_parallel_size": 8,
|
||
|
||
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
|
||
83: "data_parallel_size": 1,
|
||
|
||
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
|
||
83: "data_parallel_size": 1,
|
||
|
||
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
|
||
83: "data_parallel_size": 1,
|
||
``` |