## TODO done: - codex-chat - codex-chat-5090: codex resume 019d4945-4991-7331-a848-1be6fd702e9f - codex-coder - scoot-chat - scoot-thinking-prefill - scoot-thinking-decode dash1: codex-thinking-decode dash2: codex-thinking-prefill dash3: scoot-coder dash5: scoot-chat-5090 ```bash # chat /home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_chat_day0_t0p002_fixedcount/sampled_traces/chat_w20260311_peak_1000.jsonl # prefill-only /home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_thinking_day0_t0p04_fixedcount/sampled_traces/thinking_w20260323_peak_1000.jsonl # coder /home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_coder_peak_7day_fixedcount/sampled_traces/coder_w20260311_peak_1000.jsonl # decode-only /home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/plans/dash0123_8gpu__qwen235b__internal__decode_only__thinking__legal11_thinking_decode_only_weekly0321_0327_peak_local8/traces/thinking_w20260321_peak_1000.jsonl ``` 所以我会在代码里直接把 internal profile 的“必须 chunked prefill”为硬约束,而不是继续靠超时失败去学 Fig 7/8 加上和 real trace 相同的 semi-real ongoing: dash0: qwen235b decode-only 测试 dash1/2: qwen235b thinking 30min 测试 dash3: qwen-coder-next coder 30min 测试 dash5: 5090 qwen27b chat-0-32k 测试 4.1 证明不同 workload 不能用来 tune 不同 cluster 的数据 ✅4.2 ✅synthetic/semi-real/real 的性能对比数据: 83.91, 98.19, 98.4 65.22, 86,03,98.28 ✅synthetic/semi-real/real 的相似度对比数据 chat/thinking/coder 的 prefix 下 tuned best 的对比数据 chat/thinking/coder 的 prefix 下的相似度对比数据 4.3 agent harness 总结 5 tuner vs baseline ✅default config ```bash # Qwen3.5-27B # https://huggingface.co/Qwen/Qwen3.5-27B vllm serve Qwen/Qwen3.5-27B \ --tensor-parallel-size 8 \ --max-model-len 262144 \ --reasoning-parser qwen3 # Qwen3-Coder-Next # https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html#basic-multi-gpu-setup vllm serve Qwen/Qwen3-Coder-Next-FP8 \ --tensor-parallel-size 4 \ --enable-prefix-caching # Qwen3-235B-A22B-FP8 # https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-FP8 vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \ --tensor-parallel-size 4 \ --max-model-len 262144 # https://github.com/aliez-ren/vllm-qwen3.5-nvfp4-sm120?utm_source=chatgpt.com vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 \ --max-model-len 234567 \ --gpu-memory-utilization 0.89 \ --max-num-seqs 4 \ --max-num-batched-tokens 4096 vllm serve Qwen/Qwen3.5-27B-FP8 \ --quantization fp8 \ --dtype auto \ --gpu-memory-utilization 0.85 \ --max-model-len 131072 \ --max-num-seqs 2 \ --max-num-batched-tokens 2048 \ --tensor-parallel-size 1 # https://huggingface.co/Qwen/Qwen3.5-27B-FP8?utm_source=chatgpt.com vllm serve Qwen/Qwen3.5-27B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 ``` ```bash # 跑 qwen27b batching # running on dash2 bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/launch_qwen35_27b_tp2dp1_epoff_batching_chat0_32k_weekly_peak.sh # 跑 evaluator 的对比 # running on dash1 CASE_KIND=chat TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh CASE_KIND=coder TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh CASE_KIND=thinking_prefill TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh CASE_KIND=thinking_decode TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh # [x] 对 qwen35_27b,对齐线上 trace,search 0~4k threshold 后再测试 ./workflow threshold-search \ --hardware dash0123_8gpu \ --model qwen35_27b \ --engine internal \ --workload chat \ --phase prefill_decode \ --trace-type chat-0-4k \ --max-threshold 0.5 # qwen3-coder-next 跑不了 EP? # [x] dash3 上的要重新放到 dash2 跑 cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle bash workflow_output/plans/dash0123_8gpu__qwen35_27b__internal__prefill_decode__chat__legal10_chat0_4k/run_results_v2_trace_dash3.sh --machine-label dash2 # [x] 跑 qwen35_27b 0~32k ./launch_qwen35_27b_chat_0_32k_after_trace_prepare.sh ``` ## 可运行情况 qwen-235b ✅ qwen27b ✅ qwen-coder:需要切换到对应的 vllm qwen-30b:需要切换到对应的 container 支持 flash-infer 的版本 ``` pip install -U flashinfer-python flashinfer >= 0.7 wjh@ds-f74814b6-1-65cd484875-256zt:~$ pip list | grep flashinfer flashinfer-cubin 0.6.4 flashinfer-jit-cache 0.6.4 flashinfer-python 0.6.4 ``` ## 线上性能 qwen27b: 40 instance: Mean: 4.00 qps Max: 5.67 qps prefill: Mean: 19.3 万tpm Max: 33.9 万tpm decode: Mean: 7.24 万tpm Max: 11.9 万tpm first latency: Mean: 1.59 s Max: 11.3 s tail latency: Mean: 23.6 s Max: 46.2 s qwen30b-a3b: Mean: 0.00267 qps Max: 0.109 qps ## 模型 名称:qwen3-235b-a22b版本:256k-0717 名称:qwen3-235b-a22b版本:0717-eagle-0820 qwen3-30b-a3b版本:1m-instruct-0726-fp4 名称:qwen3-30b-a3b版本:1m-thinking-0728-fp4 名称:qwen3-coder-next版本:1m-20260129-re-mtp-fp8-torch-dtype 名称:qwen3-coder-next版本:1m-20260129-xml-tool-parser-fix 名称:qwen3.5-27b版本:256k-0223-internal 名称:qwen3.5-27b版本:256k-0223-internal-nvfp4-inputscale-fp8-attn ``` "cache_volume": { "enabled": true, "scope": "application" }, "cpfs_file_system_id": "bmcpfs-290qtyip73f85z7zt9t" ``` ## dashllm_cmd serving ``` [INFO] 2026-03-27 18:42:51,933869: {"message":"vllm engine_args: {'model': '/dev/shm/dashllm_model_2', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 1, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swap_space': 1, 'max_num_seqs': 256, 'max_num_batched_tokens': 4096, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'quantization': None, 'max_model_len': 262144, 'compilation_config': {'use_inductor': False, 'custom_ops': ['all']}, 'enable_prefix_caching': True, 'distributed_executor_backend': 'mp', 'enable_chunked_prefill': True, 'max_seq_len_to_capture': 262144}","time":"2026-03-27 18:42:51.933"} ``` ## 阿里模型 env qwen3.5-27b 一定需要 BLADNN 来支持 vl attn kernel qwen3-30b/235b/coder 都可以在不使用 BLADNN 的情况下启动 对于 235b/coder 这张已经 FP8 量化的模型,使用 BLADNN 会报错 - qwen3-coder ``` VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN ``` - qwen3.5-27b ``` VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN ``` ```bash #################################### # Qwen3.5-27B #################################### VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 1000000 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0 VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 40960 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --max-num-seqs 64 --max-num-batched-tokens 40960 #--long-prefill-token-threshold 30000 #--skip_mm_profiling --mm-processor-cache-gb 0 #--long_context_threshold 30000 #Qwen3_5ForConditionalGeneration #################################### # Qwen3-Coder #################################### VLLM_MOE_EXPERTS_OVERLAP=1 TORCH_CUDA_ARCH_LIST="9.0+PTX" VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2 # ok vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2 #################################### # Qwen3-30B-A3B #################################### # ok VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B --tensor-parallel-size 2 #################################### # Qwen3-235B-A22B #################################### VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4 # Ok vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve resource/model/464482ce.qwen3-235b-a22b/128k-0426/ --tensor-parallel-size 4 '{"gpu_memory_utilization": 0.9, "max_model_len": 262144, "enable_chunked_prefill": true, "enable_think": 1, "think_mode": "auto", "tensor_parallel_size": 1, "dtype": "bfloat16", "enforce_eager": false, "enable_prefix_caching": true, "mamba_cache_mode": "light", "distributed_executor_backend": "mp", "block_size": 64, "max_num_batched_tokens": 8192, "disable_cascade_attn": true, "speculative_config": {"method": "qwen3_next_vl_mtp", "num_speculative_tokens": 3}, "mm_processor_cache_gb": 0, "limit_mm_per_prompt": {"image": 256, "video": 64}, "compilation_config": {"cudagraph_mode": "FULL_AND_PIECEWISE", "use_inductor": false, "pass_config": {"fuse_norm_quant": false, "fuse_act_quant": false, "fuse_attn_quant": false}}, "mamba_cache_dtype": "float32", "skip_mm_profiling": true, "quantization": "fp8"}' ``` 1 GPU: TP1DP1 2 GPU: (TP2DP1, TP1DP2) x (EPON, EPOFF) 4 GPU: (TP4DP1, TP2DP2, TP1DP4) x (EPON, EPOFF) 8 GPU: (TP8DP1, TP4DP2, TP2DP4, TP1DP8) x (EPON, EPOFF) ## E2E 测试 【qwen3-coder】【0-30k】【kvs】【h20-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-nosparse-model/deployments/qwen3-coder-nosparse-model-ba4a 【qwen3-coder-flash】【0-30k】【kvs】【h20-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-flash-2025-07-28-nosparse-model/deployments/qwen3-coder-flash-2025-07-28-nosparse-model-1553 【qwen3-30b-a3b-instruct】【H20-96G-4】 1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-instruct-2507-model/deployments/qwen3-30b-a3b-instruct-2507-model-a06c - 0.9.0 【qwen3-30b-a3b-thinking】【H20-96G-4】 1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-thinking-2507-model?spm=43a6e6f6.2e152c3f.0.0.6d4c103cudzmEy - 0.10.1rc2.dev397+g312aa870b 【qwen3-235b-a22b-thinking】【P】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-4945 - 0.11.2.dev1732+gd694e5c71.d20251208 【qwen3-235b-a22b-thinking】【D】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode-21fd - 0.11.2.dev1732+gd694e5c71.d20251208 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.622e103cCLyFsA 【qwen3.5-27b】【0-32k】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-e277 cuda128_cp312_test_vllm_87905ee0_20260222_202123 0.13.0rc2.dev2067+g486e99474.d20260222.cu128 【qwen3.5-27b】【0-32k】【5090-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-f462 cuda129_cp312_test_vllm_11606 0.13.0rc2.dev2111+gb44b43f43.d20260309 【qwen3-coder-next】【0-32k】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-e776?spm=43a6e6f6.5b0a3d6a.0.0.413d103cIwReWg - 0.10.2rc2.dev168+g8f0fc60c9.d20251204 1. Hardware:5090, H20 2. Model:Qwen3.5-27B, Qwen3-30B-A3B, Qwen3-235B-A22B-FP8, Qwen3-Coder-Next-FP8 3. Trace: Chat, Thinking, Coder 测试组合: Hardware 实验 - 【qwen3.5-27b + 5090】 - 【qwen3.5-27b + H20】 Model 实验 - 【qwen3.5-27b + H20】 - 【qwen3-30b-a3b + H20】 - 【qwen3-235b-a22b + H20】 Trace 实验 - 【qwen3-30b-a3b + H20 + Chat】 - 【qwen3-30b-a3b + H20 + Thinking】 - 【qwen3-235b-a22b + H20 + Chat】 - 【qwen3-235b-a22b + H20 + Thinking】 - 【qwen3-coder-next + H20 + Coder】 --- 【qwen3-235b-a22b-instruct】【P】【8-32k】【H20-96G-8】 1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-5966?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV 2. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-9f59?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV - 0.13.0rc2.dev1948+g613d885a1.d20260108.cu128 ## 部署 【qwen3-max-2026-01-23-chat-aa8c】qwen3-max nonthinking https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W 【qwen3-max-2026-01-23-chat-9bf8】qwen3-max thinking https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ 首包耗时:Mean: 2.93 s; Max: 9.27 s 尾包耗时:Mean: 1.60 min; Max: 2.57 min [ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 10.7 qps - Max: 20.0 qps [ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.73 s - Max: 8.83 s [ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.42 min - Max: 2.22 min 【qwen3-max-qwenapp-crit-50e9】 【qwen3-max-qwenapp-crit-decode】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-qwenapp-crit-decode?spm=43a6e6f6.29ced41b.0.0.4efb103cjsq29c Input: Mean: 9131 itpr ; Max: 10817 itpr Output: Mean: 823 otpr ; Max: 987 otpr weighted tps - Mean: 46.6 otpsr - Min: 44.5 otpsr - Max: 48.2 otpsr - Mean: 2.83 万tpm - Max: 3.48 万tpm tail: [ 2c3bc7a4 | cn-beijing ] - Mean: 35.0 s - Max: 1.40 min 【qwen3-max-2025-10-30-thinking-model】 Input: Mean: 6074 itpr ; Max: 16790 itpr Output: Mean: 2062 otpr Max: 4153 otpr 1. Qwen3-Chat 【nonthinking】: qwen3-max-2026-01-23-chat-aa8c-info【v1: 包含 input/timestamp 等】 qwen3-max-2026-01-23-chat-aa8c-info【v2: 可一次性采集一周,包含 output_length】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W 2. Qwen3-Coder: qwen3-coder-next-model【包含 input/timestamp 等】 qwen3-coder-next-model-8130-info【包含 output_length】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-8130 3. Qwen3-Chat 【thinking】: qwen3-max-2026-01-23-chat-9bf8 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ 0319~0324: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-694a?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH 0324~0326: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH 0326+: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-3201?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH