Files
obsidian/projects/auto-tuner/ali trace.md

17 KiB
Raw Blame History

TODO

done:

  • codex-chat

  • codex-chat-5090: codex resume 019d4945-4991-7331-a848-1be6fd702e9f

  • codex-coder

  • scoot-chat

  • scoot-thinking-prefill

  • scoot-thinking-decode

dash1: codex-thinking-decode dash2: codex-thinking-prefill dash3: scoot-coder dash5: scoot-chat-5090

# chat
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_chat_day0_t0p002_fixedcount/sampled_traces/chat_w20260311_peak_1000.jsonl

# prefill-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_thinking_day0_t0p04_fixedcount/sampled_traces/thinking_w20260323_peak_1000.jsonl

# coder
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_coder_peak_7day_fixedcount/sampled_traces/coder_w20260311_peak_1000.jsonl

# decode-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/plans/dash0123_8gpu__qwen235b__internal__decode_only__thinking__legal11_thinking_decode_only_weekly0321_0327_peak_local8/traces/thinking_w20260321_peak_1000.jsonl

所以我会在代码里直接把 internal profile 的“必须 chunked prefill”为硬约束而不是继续靠超时失败去学

Fig 7/8 加上和 real trace 相同的 semi-real

ongoing: dash0: qwen235b decode-only 测试 dash1/2: qwen235b thinking 30min 测试 dash3: qwen-coder-next coder 30min 测试 dash5: 5090 qwen27b chat-0-32k 测试

4.1 证明不同 workload 不能用来 tune 不同 cluster 的数据

4.2 synthetic/semi-real/real 的性能对比数据: 83.91, 98.19, 98.4 65.22, 86,03,98.28

synthetic/semi-real/real 的相似度对比数据

chat/thinking/coder 的 prefix 下 tuned best 的对比数据 chat/thinking/coder 的 prefix 下的相似度对比数据

4.3 agent harness 总结

5 tuner vs baseline

default config

# Qwen3.5-27B
# https://huggingface.co/Qwen/Qwen3.5-27B
vllm serve Qwen/Qwen3.5-27B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

# Qwen3-Coder-Next
# https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html#basic-multi-gpu-setup
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching

# Qwen3-235B-A22B-FP8
# https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-FP8
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 262144

# https://github.com/aliez-ren/vllm-qwen3.5-nvfp4-sm120?utm_source=chatgpt.com
vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 \
  --max-model-len 234567 \
  --gpu-memory-utilization 0.89 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --quantization fp8 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 131072 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 2048 \
  --tensor-parallel-size 1

# https://huggingface.co/Qwen/Qwen3.5-27B-FP8?utm_source=chatgpt.com
vllm serve Qwen/Qwen3.5-27B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3

# 跑 qwen27b batching
# running on dash2
bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/launch_qwen35_27b_tp2dp1_epoff_batching_chat0_32k_weekly_peak.sh


# 跑 evaluator 的对比
# running on dash1
CASE_KIND=chat TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=coder TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_prefill TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_decode TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh



# [x] 对 qwen35_27b对齐线上 tracesearch 04k threshold 后再测试
./workflow threshold-search \
  --hardware dash0123_8gpu \
  --model qwen35_27b \
  --engine internal \
  --workload chat \
  --phase prefill_decode \
  --trace-type chat-0-4k \
  --max-threshold 0.5


# qwen3-coder-next 跑不了 EP?


# [x] dash3 上的要重新放到 dash2 跑
cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
bash workflow_output/plans/dash0123_8gpu__qwen35_27b__internal__prefill_decode__chat__legal10_chat0_4k/run_results_v2_trace_dash3.sh --machine-label dash2


# [x] 跑 qwen35_27b 0~32k
./launch_qwen35_27b_chat_0_32k_after_trace_prepare.sh

可运行情况

qwen-235b qwen27b qwen-coder需要切换到对应的 vllm qwen-30b需要切换到对应的 container 支持 flash-infer 的版本

pip install -U flashinfer-python
flashinfer >= 0.7

wjh@ds-f74814b6-1-65cd484875-256zt:~$ pip list | grep flashinfer
flashinfer-cubin                         0.6.4
flashinfer-jit-cache                     0.6.4
flashinfer-python                        0.6.4

线上性能

qwen27b: 40 instance: Mean: 4.00 qps Max: 5.67 qps

prefill: Mean: 19.3 万tpm Max: 33.9 万tpm decode: Mean: 7.24 万tpm Max: 11.9 万tpm first latency: Mean: 1.59 s Max: 11.3 s tail latency: Mean: 23.6 s Max: 46.2 s

qwen30b-a3b: Mean: 0.00267 qps Max: 0.109 qps

模型

名称qwen3-235b-a22b版本256k-0717 名称qwen3-235b-a22b版本0717-eagle-0820

qwen3-30b-a3b版本1m-instruct-0726-fp4 名称qwen3-30b-a3b版本1m-thinking-0728-fp4

名称qwen3-coder-next版本1m-20260129-re-mtp-fp8-torch-dtype 名称qwen3-coder-next版本1m-20260129-xml-tool-parser-fix

名称qwen3.5-27b版本256k-0223-internal 名称qwen3.5-27b版本256k-0223-internal-nvfp4-inputscale-fp8-attn

"cache_volume": {
  "enabled": true,
  "scope": "application"
},
"cpfs_file_system_id": "bmcpfs-290qtyip73f85z7zt9t"

dashllm_cmd serving

[INFO] 2026-03-27 18:42:51,933869: {"message":"vllm engine_args: {'model': '/dev/shm/dashllm_model_2', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 1, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swap_space': 1, 'max_num_seqs': 256, 'max_num_batched_tokens': 4096, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'quantization': None, 'max_model_len': 262144, 'compilation_config': {'use_inductor': False, 'custom_ops': ['all']}, 'enable_prefix_caching': True, 'distributed_executor_backend': 'mp', 'enable_chunked_prefill': True, 'max_seq_len_to_capture': 262144}","time":"2026-03-27 18:42:51.933"}

阿里模型 env

qwen3.5-27b 一定需要 BLADNN 来支持 vl attn kernel qwen3-30b/235b/coder 都可以在不使用 BLADNN 的情况下启动 对于 235b/coder 这张已经 FP8 量化的模型,使用 BLADNN 会报错

  • qwen3-coder
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
  • qwen3.5-27b
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
####################################
# Qwen3.5-27B
####################################
VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 1000000 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 40960 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --max-num-seqs 64 --max-num-batched-tokens 40960 #--long-prefill-token-threshold 30000 #--skip_mm_profiling --mm-processor-cache-gb 0

#--long_context_threshold 30000
#Qwen3_5ForConditionalGeneration


####################################
# Qwen3-Coder
####################################
VLLM_MOE_EXPERTS_OVERLAP=1 TORCH_CUDA_ARCH_LIST="9.0+PTX" VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
# ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2


####################################
# Qwen3-30B-A3B
####################################
# ok
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B --tensor-parallel-size 2


####################################
# Qwen3-235B-A22B
####################################
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4
# Ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve resource/model/464482ce.qwen3-235b-a22b/128k-0426/ --tensor-parallel-size 4


'{"gpu_memory_utilization": 0.9, "max_model_len": 262144, "enable_chunked_prefill": true, "enable_think": 1, "think_mode": "auto", "tensor_parallel_size": 1, "dtype": "bfloat16", "enforce_eager": false, "enable_prefix_caching": true, "mamba_cache_mode": "light", "distributed_executor_backend": "mp", "block_size": 64, "max_num_batched_tokens": 8192, "disable_cascade_attn": true, "speculative_config": {"method": "qwen3_next_vl_mtp", "num_speculative_tokens": 3}, "mm_processor_cache_gb": 0, "limit_mm_per_prompt": {"image": 256, "video": 64}, "compilation_config": {"cudagraph_mode": "FULL_AND_PIECEWISE", "use_inductor": false, "pass_config": {"fuse_norm_quant": false, "fuse_act_quant": false, "fuse_attn_quant": false}}, "mamba_cache_dtype": "float32", "skip_mm_profiling": true, "quantization": "fp8"}'

1 GPU: TP1DP1 2 GPU: (TP2DP1, TP1DP2) x (EPON, EPOFF) 4 GPU: (TP4DP1, TP2DP2, TP1DP4) x (EPON, EPOFF) 8 GPU: (TP8DP1, TP4DP2, TP2DP4, TP1DP8) x (EPON, EPOFF)

E2E 测试

【qwen3-coder】【0-30k】【kvs】【h20-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-nosparse-model/deployments/qwen3-coder-nosparse-model-ba4a

【qwen3-coder-flash】【0-30k】【kvs】【h20-96-d】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-flash-2025-07-28-nosparse-model/deployments/qwen3-coder-flash-2025-07-28-nosparse-model-1553

【qwen3-30b-a3b-instruct】【H20-96G-4】

  1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-instruct-2507-model/deployments/qwen3-30b-a3b-instruct-2507-model-a06c
  • 0.9.0

【qwen3-30b-a3b-thinking】【H20-96G-4】

  1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-thinking-2507-model?spm=43a6e6f6.2e152c3f.0.0.6d4c103cudzmEy
  • 0.10.1rc2.dev397+g312aa870b

【qwen3-235b-a22b-thinking】【P】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-4945

  • 0.11.2.dev1732+gd694e5c71.d20251208

【qwen3-235b-a22b-thinking】【D】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode-21fd

  • 0.11.2.dev1732+gd694e5c71.d20251208

https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.622e103cCLyFsA

【qwen3.5-27b】【0-32k】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-e277 cuda128_cp312_test_vllm_87905ee0_20260222_202123 0.13.0rc2.dev2067+g486e99474.d20260222.cu128

【qwen3.5-27b】【0-32k】【5090-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-f462 cuda129_cp312_test_vllm_11606 0.13.0rc2.dev2111+gb44b43f43.d20260309

【qwen3-coder-next】【0-32k】【H20-96G-8】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-e776?spm=43a6e6f6.5b0a3d6a.0.0.413d103cIwReWg

  • 0.10.2rc2.dev168+g8f0fc60c9.d20251204
  1. Hardware5090, H20
  2. ModelQwen3.5-27B, Qwen3-30B-A3B, Qwen3-235B-A22B-FP8, Qwen3-Coder-Next-FP8
  3. Trace: Chat, Thinking, Coder

测试组合: Hardware 实验

  • 【qwen3.5-27b + 5090】
  • 【qwen3.5-27b + H20】 Model 实验
  • 【qwen3.5-27b + H20】
  • 【qwen3-30b-a3b + H20】
  • 【qwen3-235b-a22b + H20】 Trace 实验
  • 【qwen3-30b-a3b + H20 + Chat】
  • 【qwen3-30b-a3b + H20 + Thinking】
  • 【qwen3-235b-a22b + H20 + Chat】
  • 【qwen3-235b-a22b + H20 + Thinking】
  • 【qwen3-coder-next + H20 + Coder】

【qwen3-235b-a22b-instruct】【P】【8-32k】【H20-96G-8】

  1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-5966?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
  2. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-9f59?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
  • 0.13.0rc2.dev1948+g613d885a1.d20260108.cu128

部署

【qwen3-max-2026-01-23-chat-aa8c】qwen3-max nonthinking https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W

【qwen3-max-2026-01-23-chat-9bf8】qwen3-max thinking https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ

首包耗时Mean: 2.93 s; Max: 9.27 s 尾包耗时Mean: 1.60 min; Max: 2.57 min [ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 10.7 qps - Max: 20.0 qps

[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.73 s - Max: 8.83 s

[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.42 min - Max: 2.22 min

【qwen3-max-qwenapp-crit-50e9】

【qwen3-max-qwenapp-crit-decode】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-qwenapp-crit-decode?spm=43a6e6f6.29ced41b.0.0.4efb103cjsq29c

Input: Mean: 9131 itpr ; Max: 10817 itpr Output: Mean: 823 otpr ; Max: 987 otpr

weighted tps - Mean: 46.6 otpsr - Min: 44.5 otpsr - Max: 48.2 otpsr

  • Mean: 2.83 万tpm - Max: 3.48 万tpm

tail: [ 2c3bc7a4 | cn-beijing ] - Mean: 35.0 s - Max: 1.40 min

【qwen3-max-2025-10-30-thinking-model】 Input: Mean: 6074 itpr ; Max: 16790 itpr Output: Mean: 2062 otpr Max: 4153 otpr

  1. Qwen3-Chat 【nonthinking】: qwen3-max-2026-01-23-chat-aa8c-info【v1: 包含 input/timestamp 等】 qwen3-max-2026-01-23-chat-aa8c-info【v2: 可一次性采集一周,包含 output_length】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W
  2. Qwen3-Coder qwen3-coder-next-model【包含 input/timestamp 等】 qwen3-coder-next-model-8130-info【包含 output_length】 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-8130
  3. Qwen3-Chat 【thinking】 qwen3-max-2026-01-23-chat-9bf8 https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ

03190324: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-694a?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH 03240326: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH 0326+: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-3201?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH