obsidian/projects/auto-tuner/ali trace.md


## TODO

done:
- codex-chat
- codex-chat-5090: codex resume 019d4945-4991-7331-a848-1be6fd702e9f
- codex-coder

- scoot-chat
- scoot-thinking-prefill
- scoot-thinking-decode

dash1: codex-thinking-decode
dash2: codex-thinking-prefill
dash3: scoot-coder
dash5: scoot-chat-5090


```bash
# chat
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_chat_day0_t0p002_fixedcount/sampled_traces/chat_w20260311_peak_1000.jsonl

# prefill-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_thinking_day0_t0p04_fixedcount/sampled_traces/thinking_w20260323_peak_1000.jsonl

# coder
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/custom_trace_windows/qwen_coder_next_internal_coder_peak_7day_fixedcount/sampled_traces/coder_w20260311_peak_1000.jsonl

# decode-only
/home/admin/cpfs/wjh/aituner/tuner-workload-principle/workflow_output/plans/dash0123_8gpu__qwen235b__internal__decode_only__thinking__legal11_thinking_decode_only_weekly0321_0327_peak_local8/traces/thinking_w20260321_peak_1000.jsonl
```

所以我会在代码里直接把 internal profile 的“必须 chunked prefill”为硬约束，而不是继续靠超时失败去学

Fig 7/8 加上和 real trace 相同的 semi-real

ongoing:
dash0: qwen235b decode-only 测试
dash1/2: qwen235b thinking 30min 测试
dash3: qwen-coder-next coder 30min 测试
dash5: 5090 qwen27b chat-0-32k 测试


4.1
证明不同 workload 不能用来 tune 不同 cluster 的数据

✅4.2
✅synthetic/semi-real/real 的性能对比数据:
83.91, 98.19, 98.4
65.22, 86,03,98.28

✅synthetic/semi-real/real 的相似度对比数据

chat/thinking/coder 的 prefix 下 tuned best 的对比数据
chat/thinking/coder 的 prefix 下的相似度对比数据

4.3
agent harness 总结

5
tuner vs baseline

✅default config


```bash
# Qwen3.5-27B
# https://huggingface.co/Qwen/Qwen3.5-27B
vllm serve Qwen/Qwen3.5-27B \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

# Qwen3-Coder-Next
# https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html#basic-multi-gpu-setup
vllm serve Qwen/Qwen3-Coder-Next-FP8 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching

# Qwen3-235B-A22B-FP8
# https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-FP8
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 262144

# https://github.com/aliez-ren/vllm-qwen3.5-nvfp4-sm120?utm_source=chatgpt.com
vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 \
  --max-model-len 234567 \
  --gpu-memory-utilization 0.89 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 4096

vllm serve Qwen/Qwen3.5-27B-FP8 \
  --quantization fp8 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 131072 \
  --max-num-seqs 2 \
  --max-num-batched-tokens 2048 \
  --tensor-parallel-size 1

# https://huggingface.co/Qwen/Qwen3.5-27B-FP8?utm_source=chatgpt.com
vllm serve Qwen/Qwen3.5-27B-FP8 --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3
```


```bash

# 跑 qwen27b batching
# running on dash2
bash /home/admin/cpfs/wjh/aituner/tuner-workload-principle/launch_qwen35_27b_tp2dp1_epoff_batching_chat0_32k_weekly_peak.sh


# 跑 evaluator 的对比
# running on dash1
CASE_KIND=chat TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=coder TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_prefill TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh

CASE_KIND=thinking_decode TRACE_SUITE=sourcekind bash ./launch_workload_evaluator_compare.sh


# [x] 对 qwen35_27b，对齐线上 trace，search 0～4k threshold 后再测试
./workflow threshold-search \
  --hardware dash0123_8gpu \
  --model qwen35_27b \
  --engine internal \
  --workload chat \
  --phase prefill_decode \
  --trace-type chat-0-4k \
  --max-threshold 0.5


# qwen3-coder-next 跑不了 EP?


# [x] dash3 上的要重新放到 dash2 跑
cd /home/admin/cpfs/wjh/aituner/tuner-workload-principle
bash workflow_output/plans/dash0123_8gpu__qwen35_27b__internal__prefill_decode__chat__legal10_chat0_4k/run_results_v2_trace_dash3.sh --machine-label dash2


# [x] 跑 qwen35_27b 0~32k
./launch_qwen35_27b_chat_0_32k_after_trace_prepare.sh

```

## 可运行情况

qwen-235b ✅
qwen27b ✅
qwen-coder：需要切换到对应的 vllm
qwen-30b：需要切换到对应的 container 支持 flash-infer 的版本

```
pip install -U flashinfer-python
flashinfer >= 0.7

wjh@ds-f74814b6-1-65cd484875-256zt:~$ pip list | grep flashinfer
flashinfer-cubin                         0.6.4
flashinfer-jit-cache                     0.6.4
flashinfer-python                        0.6.4

```

## 线上性能

 qwen27b:
 40 instance:  Mean: 4.00 qps  Max: 5.67 qps

 prefill: Mean: 19.3 万tpm  Max: 33.9 万tpm
 decode: Mean: 7.24 万tpm  Max: 11.9 万tpm
 first latency: Mean: 1.59 s  Max: 11.3 s
tail latency: Mean: 23.6 s  Max: 46.2 s


qwen30b-a3b:
Mean: 0.00267 qps  Max: 0.109 qps


## 模型

名称：qwen3-235b-a22b版本：256k-0717
名称：qwen3-235b-a22b版本：0717-eagle-0820

qwen3-30b-a3b版本：1m-instruct-0726-fp4
名称：qwen3-30b-a3b版本：1m-thinking-0728-fp4

名称：qwen3-coder-next版本：1m-20260129-re-mtp-fp8-torch-dtype
名称：qwen3-coder-next版本：1m-20260129-xml-tool-parser-fix

名称：qwen3.5-27b版本：256k-0223-internal
名称：qwen3.5-27b版本：256k-0223-internal-nvfp4-inputscale-fp8-attn


```
"cache_volume": {
  "enabled": true,
  "scope": "application"
},
"cpfs_file_system_id": "bmcpfs-290qtyip73f85z7zt9t"
```


## dashllm_cmd serving

```
[INFO] 2026-03-27 18:42:51,933869: {"message":"vllm engine_args: {'model': '/dev/shm/dashllm_model_2', 'device': 'cuda', 'dtype': 'bfloat16', 'tensor_parallel_size': 1, 'enforce_eager': False, 'gpu_memory_utilization': 0.8, 'block_size': 256, 'swap_space': 1, 'max_num_seqs': 256, 'max_num_batched_tokens': 4096, 'trust_remote_code': True, 'disable_custom_all_reduce': False, 'skip_tokenizer_init': False, 'quantization': None, 'max_model_len': 262144, 'compilation_config': {'use_inductor': False, 'custom_ops': ['all']}, 'enable_prefix_caching': True, 'distributed_executor_backend': 'mp', 'enable_chunked_prefill': True, 'max_seq_len_to_capture': 262144}","time":"2026-03-27 18:42:51.933"}
```


## 阿里模型 env

qwen3.5-27b 一定需要 BLADNN 来支持 vl attn kernel
qwen3-30b/235b/coder 都可以在不使用 BLADNN 的情况下启动
对于 235b/coder 这张已经 FP8 量化的模型，使用 BLADNN 会报错


- qwen3-coder
```
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
```

- qwen3.5-27b
```
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN
```

```bash
####################################
# Qwen3.5-27B
####################################
VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 1000000 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_DISABLE_COMPILE_CACHE=1 VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --mamba_cache_mode light --max-num-seqs 64 --max-num-batched-tokens 40960 --long-prefill-token-threshold 30000 --skip_mm_profiling --mm-processor-cache-gb 0

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_FUSED_QKVZBA_KERNEL=0 VLLM_GDN_USE_BLADNN=0 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3.5-27B --tensor-parallel-size 1 --max-num-seqs 64 --max-num-batched-tokens 40960 #--long-prefill-token-threshold 30000 #--skip_mm_profiling --mm-processor-cache-gb 0

#--long_context_threshold 30000
#Qwen3_5ForConditionalGeneration


####################################
# Qwen3-Coder
####################################
VLLM_MOE_EXPERTS_OVERLAP=1 TORCH_CUDA_ARCH_LIST="9.0+PTX" VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2
# ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_GDN_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-Next-FP8 --tensor-parallel-size 2


####################################
# Qwen3-30B-A3B
####################################
# ok
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B --tensor-parallel-size 2


####################################
# Qwen3-235B-A22B
####################################
VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4
# Ok
vllm serve /home/admin/cpfs/wjh/models/Qwen/Qwen3-235B-A22B-FP8 --tensor-parallel-size 4

VLLM_FP8_USE_BLADNN=1 VLLM_MOE_USE_BLADNN=1 VLLM_USE_V1=1 VLLM_IS_HYBRID_MODEL=1 VLLM_ENABLE_TORCH_COMPILE=1 VLLM_USE_DEEP_GEMM=0 vllm serve resource/model/464482ce.qwen3-235b-a22b/128k-0426/ --tensor-parallel-size 4


'{"gpu_memory_utilization": 0.9, "max_model_len": 262144, "enable_chunked_prefill": true, "enable_think": 1, "think_mode": "auto", "tensor_parallel_size": 1, "dtype": "bfloat16", "enforce_eager": false, "enable_prefix_caching": true, "mamba_cache_mode": "light", "distributed_executor_backend": "mp", "block_size": 64, "max_num_batched_tokens": 8192, "disable_cascade_attn": true, "speculative_config": {"method": "qwen3_next_vl_mtp", "num_speculative_tokens": 3}, "mm_processor_cache_gb": 0, "limit_mm_per_prompt": {"image": 256, "video": 64}, "compilation_config": {"cudagraph_mode": "FULL_AND_PIECEWISE", "use_inductor": false, "pass_config": {"fuse_norm_quant": false, "fuse_act_quant": false, "fuse_attn_quant": false}}, "mamba_cache_dtype": "float32", "skip_mm_profiling": true, "quantization": "fp8"}'
```


1 GPU: TP1DP1
2 GPU: (TP2DP1, TP1DP2) x (EPON, EPOFF)
4 GPU: (TP4DP1, TP2DP2, TP1DP4) x (EPON, EPOFF)
8 GPU: (TP8DP1, TP4DP2, TP2DP4, TP1DP8) x (EPON, EPOFF)

## E2E 测试

【qwen3-coder】【0-30k】【kvs】【h20-96-d】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-nosparse-model/deployments/qwen3-coder-nosparse-model-ba4a

【qwen3-coder-flash】【0-30k】【kvs】【h20-96-d】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-flash-2025-07-28-nosparse-model/deployments/qwen3-coder-flash-2025-07-28-nosparse-model-1553


【qwen3-30b-a3b-instruct】【H20-96G-4】
1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-instruct-2507-model/deployments/qwen3-30b-a3b-instruct-2507-model-a06c
- 0.9.0

【qwen3-30b-a3b-thinking】【H20-96G-4】
1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-30b-a3b-thinking-2507-model?spm=43a6e6f6.2e152c3f.0.0.6d4c103cudzmEy
- 0.10.1rc2.dev397+g312aa870b

【qwen3-235b-a22b-thinking】【P】【H20-96G-8】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-4945
- 0.11.2.dev1732+gd694e5c71.d20251208

【qwen3-235b-a22b-thinking】【D】【H20-96G-8】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode/deployments/qwen3-235b-a22b-thinking-2507-qwenapp-crit-decode-21fd
- 0.11.2.dev1732+gd694e5c71.d20251208

https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-85ed?spm=43a6e6f6.660e3d6f.0.0.622e103cCLyFsA

【qwen3.5-27b】【0-32k】【H20-96G-8】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-e277
cuda128_cp312_test_vllm_87905ee0_20260222_202123
0.13.0rc2.dev2067+g486e99474.d20260222.cu128

【qwen3.5-27b】【0-32k】【5090-8】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3.5-27b-text-model/deployments/qwen3.5-27b-text-model-f462
cuda129_cp312_test_vllm_11606
0.13.0rc2.dev2111+gb44b43f43.d20260309

【qwen3-coder-next】【0-32k】【H20-96G-8】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-e776?spm=43a6e6f6.5b0a3d6a.0.0.413d103cIwReWg
- 0.10.2rc2.dev168+g8f0fc60c9.d20251204


1. Hardware：5090, H20
2. Model：Qwen3.5-27B, Qwen3-30B-A3B, Qwen3-235B-A22B-FP8, Qwen3-Coder-Next-FP8
3. Trace: Chat, Thinking, Coder

测试组合：
Hardware 实验
- 【qwen3.5-27b + 5090】
- 【qwen3.5-27b + H20】
Model 实验
- 【qwen3.5-27b + H20】
- 【qwen3-30b-a3b + H20】
- 【qwen3-235b-a22b + H20】
Trace 实验
- 【qwen3-30b-a3b + H20 + Chat】
- 【qwen3-30b-a3b + H20 + Thinking】
- 【qwen3-235b-a22b + H20 + Chat】
- 【qwen3-235b-a22b + H20 + Thinking】
- 【qwen3-coder-next + H20 + Coder】


---
【qwen3-235b-a22b-instruct】【P】【8-32k】【H20-96G-8】
1. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-5966?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
2. https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen-plus-2025-07-28-model/deployments/qwen-plus-2025-07-28-model-9f59?spm=43a6e6f6.660e3d6f.0.0.345a103cMNZMSV
- 0.13.0rc2.dev1948+g613d885a1.d20260108.cu128


##  部署

【qwen3-max-2026-01-23-chat-aa8c】qwen3-max nonthinking
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W


【qwen3-max-2026-01-23-chat-9bf8】qwen3-max thinking
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ

首包耗时：Mean: 2.93 s; Max: 9.27 s
尾包耗时：Mean: 1.60 min; Max: 2.57 min
[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 10.7 qps - Max: 20.0 qps

[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.73 s - Max: 8.83 s

[ 363e4d99 | v-6ffe2b5b | qwen3-max | cn-beijing ] - Mean: 1.42 min - Max: 2.22 min


【qwen3-max-qwenapp-crit-50e9】


【qwen3-max-qwenapp-crit-decode】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-qwenapp-crit-decode?spm=43a6e6f6.29ced41b.0.0.4efb103cjsq29c

Input: Mean: 9131 itpr ; Max: 10817 itpr
Output: Mean: 823 otpr ; Max: 987 otpr

weighted tps - Mean: 46.6 otpsr - Min: 44.5 otpsr - Max: 48.2 otpsr
- Mean: 2.83 万tpm - Max: 3.48 万tpm

tail: [ 2c3bc7a4 | cn-beijing ] - Mean: 35.0 s - Max: 1.40 min


【qwen3-max-2025-10-30-thinking-model】
Input: Mean: 6074 itpr ; Max: 16790 itpr
Output: Mean: 2062 otpr Max: 4153 otpr


1. Qwen3-Chat 【nonthinking】:
qwen3-max-2026-01-23-chat-aa8c-info【v1: 包含 input/timestamp 等】
qwen3-max-2026-01-23-chat-aa8c-info【v2: 可一次性采集一周，包含 output_length】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-aa8c?spm=43a6e6f6.33db9dd0.0.0.6a49103cRBNW6W
2. Qwen3-Coder：
qwen3-coder-next-model【包含 input/timestamp 等】
qwen3-coder-next-model-8130-info【包含 output_length】
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-coder-next-model/deployments/qwen3-coder-next-model-8130
3. Qwen3-Chat 【thinking】：
qwen3-max-2026-01-23-chat-9bf8
https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.5a1b7ab3.0.0.7c9b103caxnySJ


0319~0324: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-694a?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH
0324~0326: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-9bf8?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH
0326+: https://dashscope-spectrum.alibaba-inc.com/console/workspaces/33e6d810/applications/qwen3-max-2026-01-23-chat/deployments/qwen3-max-2026-01-23-chat-3201?spm=43a6e6f6.33db9dd0.0.0.297a43617e9ySH