Files
obsidian/period/daily/26/260312.md

5.9 KiB
Raw Permalink Blame History

TODO

  • 分析开启 EP 性能好坏的本质原因
  • 分析内部频繁修改 cuda_graph_sizes会对性能有什么影响怎么对应到 engine 性能

workload prefill 强度的定义,考虑三个维度:

  • 总 prefill tokens
  • request input length 的分布(长度究竟怎么影响?)
  • burstQPS CV

export VLLM_DISABLE_COMPILE_CACHE=1

export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"4":0}'

总 prefill load 增大,但是从 TP1 -> TP2不仅 TPS 增大,而且 latency 降低

window_id trace_type time_scale prefill_tokens_per_second real_prefill_tokens_per_second prefix_cache_hit_tokens_per_second total_prefill_tokens total_real_prefill_tokens total_prefix_cache_hit_tokens prefix_cache_hit_rate config_id tensor_parallel_size max_num_seqs max_num_batched_tokens goodput goodput_per_gpu num_slo_pass total_requests num_errors experiment_duration mean_ttft p95_ttft started_at finished_at log_path server_log_path num_good_tokens good_token_per_second good_token_per_second_per_gpu
chat_w06 chat 1.0 12045.351666666700 6850.685 5194.666666666670 7227211 4110411 3116800 0.4312590292437840 A 1 32 8192 5.426623109964860 5.426623109964860 3258 3436 0 600.373369216919 0.735489982664932 2.7120952010154700 1773196637.8148000 1773197238.188170 results/experiment_logs/chat_w06__config_A.jsonl results/experiment_logs/chat_w06__config_A.server.log 7119724 11858.82713166710 11858.82713166710
coder_w02 coder 1.0 31264.74 14433.46 16831.280000000000 18758844 8660076 10098768 0.5383470324717240 B 2 64 16384 5.418584987116260 2.709292493558130 3252 3264 1 600.1566843986510 0.6616424000997540 2.0865608930587800 1773206129.1402500 1773206729.2969400 results/experiment_logs/coder_w02__config_B.jsonl results/experiment_logs/coder_w02__config_B.server.log 18745884 31234.98327571430 15617.49163785710
  • EP off
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": false'
qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
38:        "enable_expert_parallel": false,

qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
74:        "enable_expert_parallel": false,

qwen3-plusp/256k-1106/h20_prefill.config
43:        "enable_expert_parallel": false,

qwen3-235b-a22b/256k-0717/h20_prefill.config
43:        "enable_expert_parallel": false,

qwen3-max/256k-0922-fp4/h20_prefill.config
43:        "enable_expert_parallel": false,

qwen3-235b-a22b/256k-0723-think-cs/h20_prefill.config
44:        "enable_expert_parallel": false,
  • EP on
gahow@Gahow-MBA configs % rg '"enable_expert_parallel": true'
qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
74:        "enable_expert_parallel": true,

qwen3-235b-a22b/256k-0717/h20_decode.config
76:        "enable_expert_parallel": true,

qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
76:        "enable_expert_parallel": true,

qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
44:        "enable_expert_parallel": true,

qwen3-plusp/256k-1106/h20_decode.config
76:        "enable_expert_parallel": true,

qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
77:        "enable_expert_parallel": true,
  • TP
gahow@Gahow-MBA configs % rg '"tensor_parallel_size"'
qwen3-max/256k-0922-fp4/h20_prefill.config
51:        "tensor_parallel_size": 8,

qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
84:        "tensor_parallel_size": 8,

qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
84:        "tensor_parallel_size": 8,

qwen3-coder-plus/1m-0922-re-fp8/h20.config
14:        "tensor_parallel_size": 4,

qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_prefill.config
96:        "tensor_parallel_size": 8,

qwen3-coder-plus/1m-0922-config-fix/h20_prefill.config
43:        "tensor_parallel_size": 4,

qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
84:        "tensor_parallel_size": 4,

qwen3-30b-a3b-with-gate-next-fp4/instruct-fp4/h20.config
28:        "tensor_parallel_size": 1,
  • DP
gahow@Gahow-MBA configs % rg '"data_parallel_size"'
qwen3-plusp/256k-1106/h20_decode.config
85:        "data_parallel_size": 8,

qwen3-235b-a22b/256k-0717/h20_decode.config
85:        "data_parallel_size": 8,

qwen3-235b-a22b/256k-0723-think-cs/h20_decode.config
86:        "data_parallel_size": 8,

qwen3-max/256k-0922-fp4-pc-eplb/h20_decode.config
83:        "data_parallel_size": 1,

qwen3-1.16t-a86b/256k-0922-sq-re-nvfp4/l20c_decode.config
83:        "data_parallel_size": 1,

qwen3-coder-plus/1m-0922-config-fix/h20_decode.config
83:        "data_parallel_size": 1,