Files
obsidian/period/daily/26/260312.md

121 lines
5.9 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TODO
- [ ] 分析开启 EP 性能好坏的本质原因
- [ ] 分析内部频繁修改 cuda_graph_sizes会对性能有什么影响怎么对应到 engine 性能
workload prefill 强度的定义,考虑三个维度:
- 总 prefill tokens
- request input length 的分布(长度究竟怎么影响?)
- burstQPS CV
`export VLLM_DISABLE_COMPILE_CACHE=1`
`export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"4":0}'`
总 prefill load 增大,但是从 TP1 -> TP2不仅 TPS 增大,而且 latency 降低
| **window_id** | trace_type | time_scale | prefill_tokens_per_second | real_prefill_tokens_per_second | prefix_cache_hit_tokens_per_second | total_prefill_tokens | total_real_prefill_tokens | total_prefix_cache_hit_tokens | prefix_cache_hit_rate | config_id | tensor_parallel_size | max_num_seqs | max_num_batched_tokens | goodput | goodput_per_gpu | num_slo_pass | total_requests | num_errors | experiment_duration | mean_ttft | p95_ttft | started_at | finished_at | log_path | server_log_path | num_good_tokens | good_token_per_second | good_token_per_second_per_gpu |
| ------------- | ---------- | ---------- | ------------------------- | ------------------------------ | ---------------------------------- | -------------------- | ------------------------- | ----------------------------- | --------------------- | --------- | -------------------- | ------------ | ---------------------- | ----------------- | ----------------- | ------------ | -------------- | ---------- | ------------------- | ------------------ | ------------------ | ------------------ | ------------------ | ------------------------------------------------- | ------------------------------------------------------ | --------------- | --------------------- | ----------------------------- |
| **chat_w06** | chat | 1.0 | 12045.351666666700 | 6850.685 | 5194.666666666670 | 7227211 | 4110411 | 3116800 | 0.4312590292437840 | A | 1 | 32 | 8192 | 5.426623109964860 | 5.426623109964860 | 3258 | 3436 | 0 | 600.373369216919 | 0.735489982664932 | 2.7120952010154700 | 1773196637.8148000 | 1773197238.188170 | results/experiment_logs/chat_w06__config_A.jsonl | results/experiment_logs/chat_w06__config_A.server.log | 7119724 | 11858.82713166710 | 11858.82713166710 |
| **coder_w02** | coder | 1.0 | 31264.74 | 14433.46 | 16831.280000000000 | 18758844 | 8660076 | 10098768 | 0.5383470324717240 | B | 2 | 64 | 16384 | 5.418584987116260 | 2.709292493558130 | 3252 | 3264 | 1 | 600.1566843986510 | 0.6616424000997540 | 2.0865608930587800 | 1773206129.1402500 | 1773206729.2969400 | results/experiment_logs/coder_w02__config_B.jsonl | results/experiment_logs/coder_w02__config_B.server.log | 18745884 | 31234.98327571430 | 15617.49163785710 |
- EP off
```
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": false'
qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
38: "enable_expert_parallel": false,
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
74: "enable_expert_parallel": false,
qwen3-plusp/256k-1106/h20_prefill.config
43: "enable_expert_parallel": false,
qwen3-235b-a22b/256k-0717/h20_prefill.config
43: "enable_expert_parallel": false,
qwen3-max/256k-0922-fp4/h20_prefill.config
43: "enable_expert_parallel": false,
qwen3-235b-a22b/256k-0723-think-cs/h20_prefill.config
44: "enable_expert_parallel": false,
```
- EP on
```
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": true'
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
74: "enable_expert_parallel": true,
qwen3-235b-a22b/256k-0717/h20_decode.config
76: "enable_expert_parallel": true,
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
76: "enable_expert_parallel": true,
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
44: "enable_expert_parallel": true,
qwen3-plusp/256k-1106/h20_decode.config
76: "enable_expert_parallel": true,
qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
77: "enable_expert_parallel": true,
```
- TP
```
gahow@Gahow-MBA configs % rg '"tensor_parallel_size"'
qwen3-max/256k-0922-fp4/h20_prefill.config
51: "tensor_parallel_size": 8,
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
84: "tensor_parallel_size": 8,
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
84: "tensor_parallel_size": 8,
qwen3-coder-plus/1m-0922-re-fp8/h20.config
14: "tensor_parallel_size": 4,
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
96: "tensor_parallel_size": 8,
qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
43: "tensor_parallel_size": 4,
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
84: "tensor_parallel_size": 4,
qwen3-30b-a3b-with-gate-next-fp4/instruct-fp4/h20.config
28: "tensor_parallel_size": 1,
```
- DP
```
gahow@Gahow-MBA configs % rg '"data_parallel_size"'
qwen3-plusp/256k-1106/h20_decode.config
85: "data_parallel_size": 8,
qwen3-235b-a22b/256k-0717/h20_decode.config
85: "data_parallel_size": 8,
qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
86: "data_parallel_size": 8,
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
83: "data_parallel_size": 1,
qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
83: "data_parallel_size": 1,
qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
83: "data_parallel_size": 1,
```