56 KiB
Related Work Matrix — Workload Pattern → Serving Engine Config
| Paper | Venue / Year | 主要研究对象 | Workload signals / patterns considered | Optimized knobs / decisions | Objective | 方法类型 | 与“workload→config”关系 | 对 AITuner 的直接启发 | 是否已提出关键 insight |
|---|---|---|---|---|---|---|---|---|---|
| AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving | arXiv 2026 | 多框架 LLM serving config 优化 | production serving workloads;模型类型;硬件平台;SLA/TTFT/TPOT需求 | TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags | latency / throughput / goodput 风格目标 | 性能建模 + 快速配置搜索 | 最直接命中 | 证明“最优 engine config 明显 workload-dependent”,而且 config 空间不止 scheduler | 是,且最直接 |
| Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto) | arXiv 2026 | KV cache 分层存储配置 | real-world traces;KV block access pattern;reuse;不同优化目标 | HBM/DRAM/disk tier config;eviction/group-specific cache mgmt | latency / throughput / cost Pareto frontier | simulator + pruning + adaptive tuner | 直接命中,但只在 KV/storage 子空间 | 说明 memory-side knobs 必须 workload-aware,且天然是多目标优化 | 是,但局限于 KV tiering |
| Serving Generative Large Language Models on Preemptible Instances (SpotServe) | ASPLOS 2024 | 抢占式云环境下的 LLM serving | fluctuating workload;实例可用性变化;preemption trace | distributed parallelization config;migration strategy | latency / throughput / monetary cost | 系统设计 + 动态重配置 | 部分命中 | 说明 config = f(workload, resource-state),不是只由 workload 决定 | 是,但偏部署/并行配置 |
| Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) | arXiv 2025 | workload-aware runtime adaptation | bursty workload;real-time memory pressure;load surge | layer quantization/swapping;KV cache resizing | SLO violations、TTFT、quality-preserving efficiency | runtime adaptive serving | 部分命中 | 说明静态最优 config 不适合 bursty workload,online adaptive config 很重要 | 是,但偏 runtime morphology |
| Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI 2024 | heterogeneous requests 的动态调度 | heterogeneous / unpredictable requests;不同优先级/SLO | request placement、migration、instance-level scheduling | tail latency、priority acceleration、cost saving | 动态 rescheduling + live migration | 邻近,不是 config tuning | 说明高异质 workload 下,先做实例级隔离/迁移可能比局部 knob tuning 更重要 | 提出了 workload→architecture/scheduling insight |
-
1/2
-
1
-
2
| Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request rate;phase contention;SLO pressure | prefill/decode partition ratio;resource controller;unified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob,而且 workload-dependent | 是,但聚焦 phase partition |
| Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client composition;temporal structure;model mix;multimodal/reasoning;large-scale production traces | 不直接优化 config | benchmark realism;avoid under-provisioning | characterization + workload generator | 基础支撑,不直接命中 | 说明如果 workload 表征不真实,任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight |
| BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstiness;conversation patterns;response lengths;system failures | 不直接优化 config | evaluation realism;stress realistic serving behavior | trace dataset / characterization | 基础支撑,不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是,提出真实 workload 重要性 |
| Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency level;interactive vs batch use case;model size | framework choice(vLLM vs TGI) | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 |
| Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latency;pipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你:未来 workload→config 问题会扩展到 multi-stage,不再只看 prefill/decode | 部分提出,多阶段视角 |
我建议你在论文里把它们再压成 4 类
A. Workload characterization / benchmark realism
这些 paper 不直接调 config,但决定你的论证是否站得住。
| Paper | 你该怎么引用它 |
|---|---|
| ServeGen | 用来证明:真实 workload 很复杂,synthetic trace 容易误导配置结论 |
| BurstGPT | 用来证明:burstiness、conversation pattern、response length 这些 workload feature 不能忽略 |
B. Workload-aware scheduling / routing / partitioning
这些不是在调 engine knobs,但给了你 workload pattern 的核心结构。
| Paper | 核心已知 insight |
|---|---|
| Llumnix | heterogeneous workload 需要跨实例 rescheduling / isolation |
| EWSJF | mixed workload 应先分组/分 regime,再优化 |
| CascadeInfer | length heterogeneity 是核心 bottleneck |
| Sarathi-Serve | prefill/decode 冲突是第一性瓶颈 |
| semi-PD | phase split ratio 应随 workload / SLO 改变 |
C. Workload-aware configuration / adaptive resource shaping
这些和你最接近。
| Paper | 命中子空间 |
|---|---|
| AIConfigurator | 通用 serving config search |
| Kareto | KV / storage configuration |
| SpotServe | distributed parallelization configuration |
| MorphServe | online adaptive resource / precision configuration |
D. Framework / pipeline-level coarse configuration
这些告诉你研究边界在哪里。
| Paper | 边界意义 |
|---|---|
| Comparative vLLM vs TGI | engine choice 本身就是 coarse-grained config |
| HERMES | multi-stage pipeline 会让 workload→config 问题更复杂 |
如果你想在 paper 里更突出 gap,可以直接用这张“缺口矩阵”
| 维度 | 现有工作覆盖情况 | 你的空间 |
|---|---|---|
| 真实 workload characterization | 强 | 不必重复造数据集,但要吸收其 feature 定义 |
| workload-aware scheduling | 很强 | 不要和它们正面重合 |
| 单一子空间 config adaptation(KV / parallelism / runtime precision) | 中等 | 可以统一这些子空间 insight |
| 通用 workload→engine-knob mapping | 弱 | 这是你的主要机会 |
| 在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化 | 很弱 | 这是你最强的位置 |
-
1/2
-
1
-
2
我帮你提炼成一句话版 related work positioning
你后面写论文时,可以这么说:
Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.
这句话基本能把你和上面这些 work 的关系讲清楚。
我建议你下一步补一张“feature × knob”矩阵
也就是:
-
行:workload features
-
arrival burstiness
-
input length mean/variance/tail
-
output length mean/variance/tail
-
mixed-workload ratio
-
prefix/KV locality
-
SLO strictness
-
列:config knobs
-
max_num_batched_tokens
-
scheduler mode
-
prefill/decode split
-
KV memory fraction
-
block size
-
parallelism
-
disaggregation mode
-
cache/tier policy
然后标每篇 paper 实际覆盖了哪些 feature→knob 边。
这个会比 paper matrix 更直接服务你的项目。
如果你要,我下一条就直接给你整理 feature × knob insight matrix。
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
Glia: A Human-Inspired AI for Automated Systems Design and Optimization VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE
DynamoLLM:profile(不同的 TP、workload) + search(target:满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration NanoFlow:Theoretical model + automated pipeline search(split sequence / operation,target:最大化 compute/memory/network 利用率)
https://github.com/llm-d/llm-d
AIBrix 实验性的提出了异构 GPU 推理,workload 有不同的 size,不同的 GPU 适合不同 request size 的甜点区间,offline profile,online scheduling,异构硬件 serving 时最小化 cost 这存在老生常谈的问题:offline profile 与实际 serving 的 gap https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst
agent related:
- StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017
- ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844
- STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288
SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning
Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).
SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.
SLO-Aware SchedulingforLarge Language Model Inferences https://arxiv.org/pdf/2504.14966
https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf
FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving https://arxiv.org/abs/2602.22593
你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过:定义搜索空间/动作 → 运行试验(trial)→ 测量目标 → 用算法选择下一步动作。
-
**ATC’18(Cao et al.)**在解释黑箱 auto-tuning 时,把机制写得非常标准化:“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”。 https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf
-
**OSDI’23(Hydro)**对超参/配置调优同样给出“工作流定义”:用户指定搜索空间;算法生成 trials;系统协调执行直到找出最优配置。 https://www.usenix.org/system/files/osdi23-hu.pdf
-
**ATC’18(Metis)**把自己定位为“black-box optimization service”,在真实生产系统上调参并以 tail latency 为主要评估指标。 https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf
-
**OSDI’18(µTune)**属于“在线自适应”流派,但它仍然是同一闭环:监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。 https://www.usenix.org/system/files/osdi18-sriraman.pdf
-
**SOSP’21(POP)**是“数学优化/求解器”流派,它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式,并讨论求解速度与 SLA 的权衡。 https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf
结论:无论是黑箱搜索、在线控制、强化学习、还是数学规划,顶会系统优化基本都落在同一个元结构上:
在预算与约束下,围绕某个“可执行试验”的闭环,迭代地产生动作并更新策略。
related works
DynamoLLM:profile(不同的 TP、workload) + search(target:满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration NanoFlow:Theoretical model + automated pipeline search(split sequence / operation,target:最大化 compute/memory/network 利用率) MorphServe Autocomp
Questions
- Mooncake 访谈 10:00 处,业务场景变得极为丰富,需要不同的配置,PD 分离时几P几D,P 和 D 节点内分别跑什么样的并行模式都需要调优,Qwen 线上是否有明确的需要不同配置的需求?或者目前 Qwen 的现状还是根据人工测试调整,找到一个整体相对比较优的配置固定作为一个模型的配置,并不涉及线上的负载感知与动态调整?我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置?还是有一些更好的工作流?
- Qwen 系列模型是否已经上线 EP?如果上线了 EP,为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案?我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能?
- Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化,可以认为这是一个普遍的趋势吗?
Backup
说明:这里的“自动优化”聚焦自动化调度/并行/批处理/能耗-成本/KV-Cache 管理等系统层机制,而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张(以作者给出的 headline 为准)。
速览对照表
| 工作(年份/来源) | 一句话核心 | 关键自动化机制 | 公开性能主张(作者给出) | 与相近工作的差异点 | 在 vLLM / SGLang 落地 |
|---|---|---|---|---|---|
| DynamoLLM (2024, HPCA’25) (arXiv) | 面向集群级的能耗/成本最优化:在满足 SLO 的前提下自动重构推理集群配置 | 负载-功耗感知的动态集群重构、分层控制 | 节能 ~52–53%,运营碳减 38%,成本降 61%,SLO 保持 | 目标是服务级/集群级运行点最优化,而非 GPU 内核/批处理细节 | 与 vLLM/SGLang 并行部署(外部编排层),非内嵌;可用作上层资源控制器 |
| NanoFlow (2024→2025) (arXiv) | 在单 GPU上把一次请求切成多个nano-batches并自动搜寻并行-重叠流水线 | 自动搜索nano-batch 的数量/大小/顺序/资源配额;算子共调度以重叠算力/显存/通信 | 对多基线(vLLM、FastGen、TRT-LLM)最高 1.91× 吞吐提升 | 强调设备内并行与流水重叠;对 decode 轻/ prefll 重的结构做细粒度流水 | 目前为独立运行时(开源实现),未并入 vLLM/SGLang 主干;可作替代后端 (GitHub) |
| Sarathi & Sarathi-Serve (2023–2024) (arXiv) | 把长prefill切块,并与decode混合成连续混批,减少流水“气泡” | Chunked prefill + continuous hybrid batching(decode piggyback) | 端到端最高 1.91×;decode 吞吐最高 10× | 最早系统化提出“prefill-decode 混合一批”的通用策略 | vLLM/SGLang 均已支持/吸收该思路(见下文“落地”) (VLLM Documentation) |
| DeepSpeed-FastGen (2024) (arXiv) | Blocked KV + Dynamic SplitFuse 连续批处理,兼顾低延迟与高吞吐 | 动态拆分/融合批次,KV 分块复用 | 多模型/硬件上优于 vLLM(作者报告) | 方案与 vLLM 的 PagedAttention 路线相近但实现不同 | 作为独立后端;与 vLLM/SGLang 互为替代 |
| POD-Attention (ASPLOS’25) (Microsoft) | 追求prefill 与 decode 的全重叠,降低 Token Break Time(TBT) | Prefill/Decode 完全并行化调度与相容内核 | 在长上下文/高 TBT 场景显著降停顿(定性+量化) | 把 Sarathi 的“混合批”更推进到近全重叠 | 学术原型;思想可迁移到 vLLM/SGLang 的调度/内核层 |
| Fluid-Guided Online Scheduling(WAIT/Nested WAIT) (2025, SSRN) (SSRN) | 把 LLM 推理抽象为带 KV 内存约束的多阶段在线调度,给出近似最优策略 | 基于流体模型的在线批量与内存配给决策 | 实验优于 vLLM/Sarathi 的吞吐/时延(作者报告) | 强理论导向,内存-批量-时延三者联动的在线算法 | 研究性,尚未并入主流实现 |
| Memory-aware 动态批处理 (2025) (arXiv) | 运行时监控显存与 SLA,自适应批大小与解码过程 | 显存感知的批调度 + 延迟反馈回路 | 在 Llama-7B+A100 上优于固定超参的吞吐/时延 | 更工程化的在线批超参调节方法 | 思路与 vLLM/SGLang 的自调度兼容;尚无主干合并记录 |
| HyGen (2025) (arXiv) | 线上/离线融合:两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力 | 两阶段 SLO-aware 调度与隔离 | 在离线共置下维持在线延迟并提升整体利用率 | 关注业务混部而非单一推理吞吐 | 可与任一后端并行部署;非 vLLM/SGLang 内核改动 |
| PrefillOnly (2025) (arXiv) | 针对只需 1 token 输出的“prefill-only”工作负载做极简 KV与路径优化 | 仅保留最后一层 KV、轻量运行路径 | 对检索/分类式工作负载显著提速降时延 | 面向特定负载类型的路径裁剪 | 可作为特型后端/路径对接主流引擎 |
| vLLM / PagedAttention (2023→) (arXiv) | 分页化 KV-Cache + 预占式调度,近零碎片、易做连续批处理 | 块级内存管理、连续批、请求抢占 | 对 HF baseline 最高 24× 吞吐(早期报告) | 率先把内存分页与连续批标准化 | 已成为事实上的开源主流服务引擎之一 |
| Throughput-Optimal Scheduling for LLM Serving (2025) (arXiv) | 从理论上给连续批处理的吞吐上界/最优策略 | 令牌级排队/匹配策略 | 给出与实践接轨的理论最优性结果 | 偏理论基线,指导工程实现 | 待工程化吸收 |
| Learning-to-Rank Scheduler (2024) | 预测请求相对长度排序,更逼近 SJF 以降延迟/提吞吐 | LTR 训练的调度器,按预测顺序排队/并批 | 相比现有基线更优的延迟/完成时间 | 与两者同属自动排队/批形成 | 可接入 vLLM/SGLang 的 admission/排队阶段。 (arXiv) |
| Online Scheduling with KV Constraints (2025) | 把 LLM 推理抽象为带 KV 内存约束的在线调度,给出近优策略 | 流体/队列模型 + 在线批/KV 决策 | 优于 vLLM/Sarathi 基线(作者报告) | 与两者同属自动在线调度理论化 | 可做 vLLM/SGLang 的策略插件(需低开销遥测)。 (arXiv) |
| Fairness-Aware Batch Formation (2025.10) | 在连续/混合批下自动平衡新/旧请求的“计算公平性”与吞吐 | 批内配额/重排策略 | 在保持吞吐下显著改善不公平 | 同属批形成自动化(与 Sarathi/Chunked 互补) | 可改造 vLLM/SGLang 的 batcher。 (arXiv) |
| Drift(PD-Multiplexing) (2025) | 相位解耦复用:预填与解码相位分离并“就地”复用,缓解吞吐-时延拉扯 | 相位解耦 + in-place compute 复用 | 在多负载下同时提吞吐与守 SLO | 与 NanoFlow/POD 同属相位级重叠路线 | 需要内核/调度联动,适合作为引擎深改方向。 (Han Zhao 赵涵) |
Qwen optimization
Catalogued how Alibaba’s Qwen-family models have been tuned across the codebase.
Optimization Matrix
| Model scope | Optimization & commits | Why it was added | Feature / hardware fit |
|---|---|---|---|
| Qwen3-VL | Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443 | Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids | Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput |
| Qwen3-VL | Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296 | Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation | Sustains high-frame video pipelines and DP sharding without CUDA synchronizations |
| Qwen3-VL | Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420 | Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability | Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements |
| Qwen2.5-VL | Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814 | Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback | Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs |
| Qwen2.5-VL | O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897 | Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously | Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads |
| Qwen2-VL | Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805 | Allows disabling TP in favor of DP for the encoder and caches attention metadata | Optimizes multi-GPU deployments handling many images per request |
| Qwen2/2.5-VL startup | Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360 | Avoids generating huge fake inputs to find token limits, slashing initialization time | Keeps autoregressive limits aligned with true max resolution / frame counts before serving |
| Qwen series (CUDA & ROCm) | Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274 | Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback | Maintains high-throughput rotary embedding on both CUDA and ROCm stacks |
| Qwen2 text & MoE | FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1 | Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels | Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4 |
| Qwen3 MoE family (Next / Coder / Thinking) | Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 | Provides block-size / warp tuning for each TP×EP layout and precision mix | Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning |
Key observations
- Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
- Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
- Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.
Next steps
- Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
- Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.
| Model | Optimization | Why it was added | Feature / hardware fit |
|---|---|---|---|
| Qwen3‑Next | SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) | Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly | Hybrid MoE layers with shared experts on DP/EP deployments |
| Qwen3‑Next | Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342) | Guarantees tokens are sharded across TP ranks before expert dispatch so TP × EP + DeepEP doesn’t repeat work | Large-TG throughput setups using TP+EP or DeepEP all-to-all |
| Qwen3‑Next | Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59) | Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills | Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding |
| Qwen3‑Next | Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471) | Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges | Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths |
| Qwen3‑Next | Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306) | Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device() | Unlocks ROCm and future device backends without code forks |
| Qwen3‑Next MoE configs | Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1) | Ships per-topology block sizes and warp counts that match Alibaba’s published sweeps | Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper |
| Qwen3 dense | Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199) | Exposes transformers’ dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up | Long-context serving (>128 K) on dense Qwen3 checkpoints |
| Qwen3‑MoE | Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192) | Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls | TP × EP launches with DeepEP / allgather-reducescatter all-to-all |
| Qwen3‑MoE | Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167) | Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer | Any MoE run with TP>1, especially bandwidth-bound clusters |
| Qwen3‑MoE | FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021) | Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded | Delivers FP8 throughput on Triton / Cutlass backends for 64–128 experts |
| Qwen3‑MoE | Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467) | Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban | Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models |
Highlights & context
- The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
- SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
- Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibaba’s long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.
Next steps
- Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
- If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.
Evidence: when the team first tuned Qwen2’s 57 B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53 req/s (11 058 tok/s) to 12.47 req/s (13 089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with hand‑picked block and warp sizes for that GPU. The newer GB200 FP8 tables you’re looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance – that’s why they use more aggressive BLOCK_SIZE_N and higher num_stages.
Runtime Optimizations
| Feature / Area | Commits (Date) | Optimization Highlights | Perf Notes |
|---|---|---|---|
| Qwen2/2.5-VL vision attention rearrange | 5c2acb27 (2025-10-18) | Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py). | Not reported; removes extra tensor copies. |
| Qwen3-next gated RMSNorm | 82e64c7 (2025-10-12) | Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py). | No metric; fewer kernel launches and better occupancy. |
| Qwen3-next MTP bool-mask handling | 785d8b6 (2025-10-16) | Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction. | Not stated; expected higher MTP throughput. |
| Qwen3-VL fast_pos_embed_interpolate | 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12) | Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py. | No metric; reduces Python overhead for large image/video grids. |
| Qwen3-VL multimodal tensor prep | 0426e3c5 & 2c1c7dfb (2025-10-09) | _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel. | Not reported; fewer allocations/kernels. |
| Qwen3-VL deepstack splitting | 1dfea5f4 (2025-09-19) | Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. | No metric; lighter host-side work. |
| Qwen3-VL interleaved MRoPE | cea91a32 (2025-09-19) | Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py). | Not reported; avoids slow native fallback. |
| Qwen3Moe fused output | 4f510bc2 (2025-08-20) | Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py. | Not stated; less collective traffic. |
| Qwen3-next FP8 checkpoints | ef7eefe1 (2025-09-18) | Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py. | No metric; enables FP8 inference path. |
| Qwen3 FP8 accuracy guard | a258ad8b (2025-08-17) | Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py). | Not reported; quality fix to keep FP8 speed viable. |
| Qwen3 fused RMSNorm | f80ae5bd (2025-05-07) | Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models. | No metric; fewer RMSNorm kernels. |
| Qwen3 reasoning parser | 015069b0 (2025-05-01) | Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py. | Not measured; faster host parsing. |
| Qwen2.5-VL rotary/window pipeline | 67da5720 (2025-05-16); e283976f (2025-09-09) | Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py. | No metrics; eliminates repeated cudaMemcpy and sorting. |
| Qwen2/2.5-VL CUDA sync avoidance | 6772bb0f (2025-08-13); 60f0843e (2025-09-08) | Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py. | Not reported; prevents blocking host waits. |
| Qwen2.5-VL normalization & SDPA | 02ed8a1f (2025-02-13); cbc8457b (2025-08-07) | Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py. | No metric; more efficient norm/attention. |
| Qwen2/2.5-VL init & DP | 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) | Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency. | Not reported; faster init and lower memory. |
| Qwen2/2.5-VL attention masks | 47c7126 (2025-03-21) | Restored attention mask precomputation for mixed window/full attention paths. | No metrics; reduces per-forward recompute. |
| Qwen2-VL cudaMemcpy reduction | 70b808fe (2025-03-11) | Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention. | Not reported; lowers cudaMemcpyAsync volume. |
| Qwen2 FP8 KV cache | da971ec7 (2024-06-19) | Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth. | No metrics; expected memory savings. |
| Qwen2 pipeline parallel | 1d2e7fb7 (2024-08-01) | Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs. | No metric; enables scale-out throughput. |
| Qwen LoRA punica specialization | 1f567421 (2024-06-21); 8435b207 (2024-05-17) | Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels. | No metrics; better LoRA throughput. |
Kernel Config Tuning (Fused MoE / FP8)
| Commits (Date) | Model / HW Target | Optimization | Perf Metrics |
|---|---|---|---|
| 4d0f2661 (2025-10-20) | Qwen3-30B A3/A3B on H100 (FP8 & BF16) | Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes. | Not reported. |
| f96bc364 (2025-10-15) | Qwen3-Next FP8 on H100 TP=2 | Introduced TP2-specific FP8 fused_moe config. | Not reported. |
| 238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07) | Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100 | New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage. | Not reported. |
| 8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28) | Qwen3-Coder-480B-A35B on NVIDIA H20-3e | Added FP8 fused_moe configs for large coder variant. | Not reported. |
| 2d40665 (2025-06-11) | Qwen3-30B A3B on NVIDIA B200 | Introduced B200-specific FP8 fused_moe config. | Not reported. |
| 22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08) | Qwen3-235B A22B on NVIDIA H20-3e & A100 | Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100). | Not reported. |
| 8fc88d63 (2025-04-28) | Qwen3 MoE (H100/H200/H20 targets) | Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README. | Not reported. |
| dcbac4cb (2025-04-28) | Qwen3 dense FP8 | Adjusted linear layers so FP8 compatibility works with fused kernels. | Not reported. |
| 2007d4d5 & f5a3c655 (2025-05-01) | Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X | Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]). | Not reported. |
| bd439735 (2024-06-14) | Qwen2-57B-A14B (TP2/TP4) | Tuned fused_moe configs for A100/H100; benchmarks show +18–20% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +18–14% tokens/s. | Throughput gains published in commit message. |
Next steps (optional):
- Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that weren’t reported.
- For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.
Model optimization methods:
- 减少 host/device 之间的 copy
- 优化 fused_moe configs,让 Triton 算子在特定 E/N/device/quantization 下更高效
- 提高 CPU/GPU 之间的 overlap
notes
Mingxing Zhang:
一个 agent 的不同阶段可能业务需求就不一样,业务需求不同,需要不同 parallelism
value = function / cost GPU 利用率已经 70~80%,单一场景空间继续提升很困难,提升多场景下,不同 function 使用不同的 cost,value 最大化
mooncake 适配 vllm 和 sglang 时,工程方法不一样,需要不同的做法,需要分别优化适配推理框架
Model Optimization Summary
Kernel
fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子,保证 SM 的利用率、且 share memory、registers 不爆掉
GDNAttention for Qwen3-Next
Data Movement
Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoint’s architecture while staying compatible with speculative decoding and parallelism features.
| Optimization | Match-to-model feature | Performance effect | Evidence/data |
|---|---|---|---|
| Gated DeltaNet linear-attention backend with fused gating kernels | Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs | Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance | Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo |
| Shared Fused MoE with expert parallel load balancing | Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps | Maintains Qwen3 Next’s shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost | Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present |
| NextN multi-token predictor (MTP) path | Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner | Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled | MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published |
| Mamba-style state management for speculative decode | Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next | Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models | State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied |
No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.
Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.
引入 PD 分离后,EPxDP 时 调度器 影响更大,在线的并行模式调整,可能不如调度器的调整,让每个 rank 收到的 pattern 相对固定
