obsidian/ali trace.md at a57afa86b47c58aeca557e7cbcb0d38b81159d78

gahow/obsidian

Fork 0

Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

17 KiB

Raw Blame History

TODO

done:

codex-chat
codex-chat-5090: codex resume 019d4945-4991-7331-a848-1be6fd702e9f
codex-coder
scoot-chat
scoot-thinking-prefill
scoot-thinking-decode

dash1: codex-thinking-decode dash2: codex-thinking-prefill dash3: scoot-coder dash5: scoot-chat-5090

# chat
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_chat_day0_t0p002_fixedcount/sampled_traces/chat_w20260311_peak_1000.jsonl

# prefill-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_thinking_day0_t0p04_fixedcount/sampled_traces/thinking_w20260323_peak_1000.jsonl

# coder
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_coder_peak_7day_fixedcount/sampled_traces/coder_w20260311_peak_1000.jsonl

# decode-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/plans/dash0123_8gpu__qwen235b__internal__decode_only__thinking__legal11_thinking_decode_only_weekly0321_0327_peak_local8/traces/thinking_w20260321_peak_1000.jsonl

所以我会在代码里直接把 internal profile 的“必须 chunked prefill”为硬约束，而不是继续靠超时失败去学

Fig 7/8 加上和 real trace 相同的 semi-real

ongoing: dash0: qwen235b decode-only 测试 dash1/2: qwen235b thinking 30min 测试 dash3: qwen-coder-next coder 30min 测试 dash5: 5090 qwen27b chat-0-32k 测试

4.1 证明不同 workload 不能用来 tune 不同 cluster 的数据

✅4.2 ✅synthetic/semi-real/real 的性能对比数据: 83.91, 98.19, 98.4 65.22, 86,03,98.28

✅synthetic/semi-real/real 的相似度对比数据

chat/thinking/coder 的 prefix 下 tuned best 的对比数据 chat/thinking/coder 的 prefix 下的相似度对比数据

4.3 agent harness 总结

5 tuner vs baseline

✅default config

# Qwen3.5-27B
# https://huggingface.co/Qwen/Qwen3.5-27B
vllm serve Qwen/Qwen3.5-27B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

# Qwen3-Coder-Next
# https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html#basic-multi-gpu-setup
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching

# Qwen3-235B-A22B-FP8
# https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-FP8
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 262144

# https://github.com/aliez-ren/vllm-qwen3.5-nvfp4-sm120?utm_source=chatgpt.com
vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 \
  --max-model-len 234567 \
  --gpu-memory-utilization 0.89 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --quantization fp8 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 131072 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 2048 \
  --tensor-parallel-size 1

# https://huggingface.co/Qwen/Qwen3.5-27B-FP8?utm_source=chatgpt.com
vllm serve Qwen/Qwen3.5-27B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3


# 跑 qwen27b batching
# running on dash2
bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/launch_qwen35_27b_tp2dp1_epoff_batching_chat0_32k_weekly_peak.sh


# 跑 evaluator 的对比
# running on dash1
CASE_KIND=chat TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=coder TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_prefill TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_decode TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh



# [x] 对 qwen35_27b，对齐线上 trace，search 0～4k threshold 后再测试
./workflow threshold-search \
  --hardware dash0123_8gpu \
  --model qwen35_27b \
  --engine internal \
  --workload chat \
  --phase prefill_decode \
  --trace-type chat-0-4k \
  --max-threshold 0.5


# qwen3-coder-next 跑不了 EP?


# [x] dash3 上的要重新放到 dash2 跑
cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
bash workflow_output/plans/dash0123_8gpu__qwen35_27b__internal__prefill_decode__chat__legal10_chat0_4k/run_results_v2_trace_dash3.sh --machine-label dash2


# [x] 跑 qwen35_27b 0~32k
./launch_qwen35_27b_chat_0_32k_after_trace_prepare.sh

可运行情况

qwen-235b ✅ qwen27b ✅ qwen-coder：需要切换到对应的 vllm qwen-30b：需要切换到对应的 container 支持 flash-infer 的版本

pip install -U flashinfer-python
flashinfer >= 0.7

wjh@ds-f74814b6-1-65cd484875-256zt:~$ pip list | grep flashinfer
flashinfer-cubin                         0.6.4
flashinfer-jit-cache                     0.6.4
flashinfer-python                        0.6.4

线上性能

qwen27b: 40 instance: Mean: 4.00 qps Max: 5.67 qps

prefill: Mean: 19.3 万tpm Max: 33.9 万tpm decode: Mean: 7.24 万tpm Max: 11.9 万tpm first latency: Mean: 1.59 s Max: 11.3 s tail latency: Mean: 23.6 s Max: 46.2 s

qwen30b-a3b: Mean: 0.00267 qps Max: 0.109 qps

模型

名称：qwen3-235b-a22b版本：256k-0717 名称：qwen3-235b-a22b版本：0717-eagle-0820

qwen3-30b-a3b版本：1m-instruct-0726-fp4 名称：qwen3-30b-a3b版本：1m-thinking-0728-fp4

名称：qwen3-coder-next版本：1m-20260129-re-mtp-fp8-torch-dtype 名称：qwen3-coder-next版本：1m-20260129-xml-tool-parser-fix

名称：qwen3.5-27b版本：256k-0223-internal 名称：qwen3.5-27b版本：256k-0223-internal-nvfp4-inputscale-fp8-attn

"cache_volume": {
  "enabled": true,
  "scope": "application"
},
"cpfs_file_system_id": "bmcpfs-290qtyip73f85z7zt9t"

dashllm_cmd serving

[INFO] 2026-03-27 18:42:51,933869: {"message":"vllm engine_args: {'model': '/dev/shm/dashllm_model_2', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 1, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swap_space': 1, 'max_num_seqs': 256, 'max_num_batched_tokens': 4096, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'quantization': None, 'max_model_len': 262144, 'compilation_config': {'use_inductor': False, 'custom_ops': ['all']}, 'enable_prefix_caching': True, 'distributed_executor_backend': 'mp', 'enable_chunked_prefill': True, 'max_seq_len_to_capture': 262144}","time":"2026-03-27 18:42:51.933"}

阿里模型 env

qwen3.5-27b 一定需要 BLADNN 来支持 vl attn kernel qwen3-30b/235b/coder 都可以在不使用 BLADNN 的情况下启动对于 235b/coder 这张已经 FP8 量化的模型，使用 BLADNN 会报错

qwen3-coder

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN

qwen3.5-27b

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN

####################################
# Qwen3.5-27B
####################################
VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 1000000 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 40960 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --max-num-seqs 64 --max-num-batched-tokens 40960 #--long-prefill-token-threshold 30000 #--skip_mm_profiling --mm-processor-cache-gb 0

#--long_context_threshold 30000
#Qwen3_5ForConditionalGeneration


####################################
# Qwen3-Coder
####################################
VLLM_MOE_EXPERTS_OVERLAP=1 TORCH_CUDA_ARCH_LIST="9.0+PTX" VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
# ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2


####################################
# Qwen3-30B-A3B
####################################
# ok
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B --tensor-parallel-size 2


####################################
# Qwen3-235B-A22B
####################################
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4
# Ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve resource/model/464482ce.qwen3-235b-a22b/128k-0426/ --tensor-parallel-size 4


'{"gpu_memory_utilization": 0.9, "max_model_len": 262144, "enable_chunked_prefill": true, "enable_think": 1, "think_mode": "auto", "tensor_parallel_size": 1, "dtype": "bfloat16", "enforce_eager": false, "enable_prefix_caching": true, "mamba_cache_mode": "light", "distributed_executor_backend": "mp", "block_size": 64, "max_num_batched_tokens": 8192, "disable_cascade_attn": true, "speculative_config": {"method": "qwen3_next_vl_mtp", "num_speculative_tokens": 3}, "mm_processor_cache_gb": 0, "limit_mm_per_prompt": {"image": 256, "video": 64}, "compilation_config": {"cudagraph_mode": "FULL_AND_PIECEWISE", "use_inductor": false, "pass_config": {"fuse_norm_quant": false, "fuse_act_quant": false, "fuse_attn_quant": false}}, "mamba_cache_dtype": "float32", "skip_mm_profiling": true, "quantization": "fp8"}'

1 GPU: TP1DP1 2 GPU: (TP2DP1, TP1DP2) x (EPON, EPOFF) 4 GPU: (TP4DP1, TP2DP2, TP1DP4) x (EPON, EPOFF) 8 GPU: (TP8DP1, TP4DP2, TP2DP4, TP1DP8) x (EPON, EPOFF)