Files
obsidian/projects/auto-tuner/related-works.md

56 KiB
Raw Permalink Blame History

Related Work Matrix — Workload Pattern → Serving Engine Config

Paper Venue / Year 主要研究对象 Workload signals / patterns considered Optimized knobs / decisions Objective 方法类型 与“workload→config”关系 对 AITuner 的直接启发 是否已提出关键 insight
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving arXiv 2026 多框架 LLM serving config 优化 production serving workloads模型类型硬件平台SLA/TTFT/TPOT需求 TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags latency / throughput / goodput 风格目标 性能建模 + 快速配置搜索 最直接命中 证明“最优 engine config 明显 workload-dependent”而且 config 空间不止 scheduler 是,且最直接
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto) arXiv 2026 KV cache 分层存储配置 real-world tracesKV block access patternreuse不同优化目标 HBM/DRAM/disk tier configeviction/group-specific cache mgmt latency / throughput / cost Pareto frontier simulator + pruning + adaptive tuner 直接命中,但只在 KV/storage 子空间 说明 memory-side knobs 必须 workload-aware且天然是多目标优化 是,但局限于 KV tiering
Serving Generative Large Language Models on Preemptible Instances (SpotServe) ASPLOS 2024 抢占式云环境下的 LLM serving fluctuating workload实例可用性变化preemption trace distributed parallelization configmigration strategy latency / throughput / monetary cost 系统设计 + 动态重配置 部分命中 说明 config = f(workload, resource-state),不是只由 workload 决定 是,但偏部署/并行配置
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) arXiv 2025 workload-aware runtime adaptation bursty workloadreal-time memory pressureload surge layer quantization/swappingKV cache resizing SLO violations、TTFT、quality-preserving efficiency runtime adaptive serving 部分命中 说明静态最优 config 不适合 bursty workloadonline adaptive config 很重要 是,但偏 runtime morphology
Llumnix: Dynamic Scheduling for Large Language Model Serving OSDI 2024 heterogeneous requests 的动态调度 heterogeneous / unpredictable requests不同优先级/SLO request placement、migration、instance-level scheduling tail latency、priority acceleration、cost saving 动态 rescheduling + live migration 邻近,不是 config tuning 说明高异质 workload 下,先做实例级隔离/迁移可能比局部 knob tuning 更重要 提出了 workload→architecture/scheduling insight
  • 1/2

  • 1

  • 2

| Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request ratephase contentionSLO pressure | prefill/decode partition ratioresource controllerunified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob而且 workload-dependent | 是,但聚焦 phase partition |
| Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client compositiontemporal structuremodel mixmultimodal/reasoninglarge-scale production traces | 不直接优化 config | benchmark realismavoid under-provisioning | characterization + workload generator | 基础支撑,不直接命中 | 说明如果 workload 表征不真实,任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight |
| BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstinessconversation patternsresponse lengthssystem failures | 不直接优化 config | evaluation realismstress realistic serving behavior | trace dataset / characterization | 基础支撑,不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是,提出真实 workload 重要性 |
| Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency levelinteractive vs batch use casemodel size | framework choicevLLM vs TGI | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 |
| Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latencypipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你:未来 workload→config 问题会扩展到 multi-stage不再只看 prefill/decode | 部分提出,多阶段视角 |

我建议你在论文里把它们再压成 4 类

A. Workload characterization / benchmark realism

这些 paper 不直接调 config但决定你的论证是否站得住。

Paper 你该怎么引用它
ServeGen 用来证明:真实 workload 很复杂synthetic trace 容易误导配置结论
BurstGPT 用来证明burstiness、conversation pattern、response length 这些 workload feature 不能忽略

B. Workload-aware scheduling / routing / partitioning

这些不是在调 engine knobs但给了你 workload pattern 的核心结构。

Paper 核心已知 insight
Llumnix heterogeneous workload 需要跨实例 rescheduling / isolation
EWSJF mixed workload 应先分组/分 regime再优化
CascadeInfer length heterogeneity 是核心 bottleneck
Sarathi-Serve prefill/decode 冲突是第一性瓶颈
semi-PD phase split ratio 应随 workload / SLO 改变

C. Workload-aware configuration / adaptive resource shaping

这些和你最接近。

Paper 命中子空间
AIConfigurator 通用 serving config search
Kareto KV / storage configuration
SpotServe distributed parallelization configuration
MorphServe online adaptive resource / precision configuration

D. Framework / pipeline-level coarse configuration

这些告诉你研究边界在哪里。

Paper 边界意义
Comparative vLLM vs TGI engine choice 本身就是 coarse-grained config
HERMES multi-stage pipeline 会让 workload→config 问题更复杂

如果你想在 paper 里更突出 gap可以直接用这张“缺口矩阵”

维度 现有工作覆盖情况 你的空间
真实 workload characterization 不必重复造数据集,但要吸收其 feature 定义
workload-aware scheduling 很强 不要和它们正面重合
单一子空间 config adaptationKV / parallelism / runtime precision 中等 可以统一这些子空间 insight
通用 workload→engine-knob mapping 这是你的主要机会
在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化 很弱 这是你最强的位置
  • 1/2

  • 1

  • 2

我帮你提炼成一句话版 related work positioning

你后面写论文时,可以这么说:

Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.

这句话基本能把你和上面这些 work 的关系讲清楚。

我建议你下一步补一张“feature × knob”矩阵

也就是:

  • workload features

  • arrival burstiness

  • input length mean/variance/tail

  • output length mean/variance/tail

  • mixed-workload ratio

  • prefix/KV locality

  • SLO strictness

  • config knobs

  • max_num_batched_tokens

  • scheduler mode

  • prefill/decode split

  • KV memory fraction

  • block size

  • parallelism

  • disaggregation mode

  • cache/tier policy

然后标每篇 paper 实际覆盖了哪些 feature→knob 边。

这个会比 paper matrix 更直接服务你的项目。

如果你要,我下一条就直接给你整理 feature × knob insight matrix。

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Glia: A Human-Inspired AI for Automated Systems Design and Optimization VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE

DynamoLLMprofile不同的 TP、workload + searchtarget满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration NanoFlowTheoretical model + automated pipeline searchsplit sequence / operationtarget最大化 compute/memory/network 利用率)

https://github.com/llm-d/llm-d

AIBrix 实验性的提出了异构 GPU 推理workload 有不同的 size不同的 GPU 适合不同 request size 的甜点区间offline profileonline scheduling异构硬件 serving 时最小化 cost 这存在老生常谈的问题offline profile 与实际 serving 的 gap https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst

agent related:

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning

Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).

SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.

!projects/auto-tuner/related-works.figs/260410-105227.png

SLO-Aware SchedulingforLarge Language Model Inferences https://arxiv.org/pdf/2504.14966

https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving https://arxiv.org/abs/2602.22593

你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过:定义搜索空间/动作 → 运行试验trial→ 测量目标 → 用算法选择下一步动作

结论:无论是黑箱搜索、在线控制、强化学习、还是数学规划,顶会系统优化基本都落在同一个元结构上:

在预算与约束下,围绕某个“可执行试验”的闭环,迭代地产生动作并更新策略。


related works

DynamoLLMprofile不同的 TP、workload + searchtarget满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration NanoFlowTheoretical model + automated pipeline searchsplit sequence / operationtarget最大化 compute/memory/network 利用率) MorphServe Autocomp

Questions

  1. Mooncake 访谈 10:00 处业务场景变得极为丰富需要不同的配置PD 分离时几P几DP 和 D 节点内分别跑什么样的并行模式都需要调优Qwen 线上是否有明确的需要不同配置的需求?或者目前 Qwen 的现状还是根据人工测试调整,找到一个整体相对比较优的配置固定作为一个模型的配置,并不涉及线上的负载感知与动态调整?我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置?还是有一些更好的工作流?
  2. Qwen 系列模型是否已经上线 EP如果上线了 EP为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案?我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能?
  3. Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化,可以认为这是一个普遍的趋势吗?

Backup

说明:这里的“自动优化”聚焦自动化调度/并行/批处理/能耗-成本/KV-Cache 管理等系统层机制,而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张(以作者给出的 headline 为准)。

速览对照表

工作(年份/来源) 一句话核心 关键自动化机制 公开性能主张(作者给出) 与相近工作的差异点 在 vLLM / SGLang 落地
DynamoLLM (2024, HPCA25) (arXiv) 面向集群级的能耗/成本最优化:在满足 SLO 的前提下自动重构推理集群配置 负载-功耗感知的动态集群重构、分层控制 节能 ~5253%,运营碳减 38%,成本降 61%SLO 保持 目标是服务级/集群级运行点最优化,而非 GPU 内核/批处理细节 与 vLLM/SGLang 并行部署(外部编排层),非内嵌;可用作上层资源控制器
NanoFlow (2024→2025) (arXiv) 单 GPU上把一次请求切成多个nano-batches并自动搜寻并行-重叠流水线 自动搜索nano-batch 的数量/大小/顺序/资源配额;算子共调度以重叠算力/显存/通信 对多基线vLLM、FastGen、TRT-LLM最高 1.91× 吞吐提升 强调设备内并行与流水重叠;对 decode 轻/ prefll 重的结构做细粒度流水 目前为独立运行时(开源实现),未并入 vLLM/SGLang 主干;可作替代后端 (GitHub)
Sarathi & Sarathi-Serve (20232024) (arXiv) 把长prefill切块,并与decode混合成连续混批,减少流水“气泡” Chunked prefill + continuous hybrid batchingdecode piggyback 端到端最高 1.91×decode 吞吐最高 10× 最早系统化提出“prefill-decode 混合一批”的通用策略 vLLM/SGLang 均已支持/吸收该思路(见下文“落地”) (VLLM Documentation)
DeepSpeed-FastGen (2024) (arXiv) Blocked KV + Dynamic SplitFuse 连续批处理,兼顾低延迟与高吞吐 动态拆分/融合批次KV 分块复用 多模型/硬件上优于 vLLM作者报告 方案与 vLLM 的 PagedAttention 路线相近但实现不同 作为独立后端;与 vLLM/SGLang 互为替代
POD-Attention (ASPLOS25) (Microsoft) 追求prefill 与 decode 的全重叠,降低 Token Break TimeTBT Prefill/Decode 完全并行化调度与相容内核 在长上下文/高 TBT 场景显著降停顿(定性+量化) 把 Sarathi 的“混合批”更推进到近全重叠 学术原型;思想可迁移到 vLLM/SGLang 的调度/内核层
Fluid-Guided Online SchedulingWAIT/Nested WAIT (2025, SSRN) (SSRN) 把 LLM 推理抽象为带 KV 内存约束的多阶段在线调度,给出近似最优策略 基于流体模型的在线批量与内存配给决策 实验优于 vLLM/Sarathi 的吞吐/时延(作者报告) 强理论导向,内存-批量-时延三者联动的在线算法 研究性,尚未并入主流实现
Memory-aware 动态批处理 (2025) (arXiv) 运行时监控显存与 SLA自适应批大小与解码过程 显存感知的批调度 + 延迟反馈回路 在 Llama-7B+A100 上优于固定超参的吞吐/时延 更工程化的在线批超参调节方法 思路与 vLLM/SGLang 的自调度兼容;尚无主干合并记录
HyGen (2025) (arXiv) 线上/离线融合:两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力 两阶段 SLO-aware 调度与隔离 在离线共置下维持在线延迟并提升整体利用率 关注业务混部而非单一推理吞吐 可与任一后端并行部署;非 vLLM/SGLang 内核改动
PrefillOnly (2025) (arXiv) 针对只需 1 token 输出的“prefill-only”工作负载做极简 KV与路径优化 仅保留最后一层 KV、轻量运行路径 对检索/分类式工作负载显著提速降时延 面向特定负载类型的路径裁剪 可作为特型后端/路径对接主流引擎
vLLM / PagedAttention (2023→) (arXiv) 分页化 KV-Cache + 预占式调度,近零碎片、易做连续批处理 块级内存管理、连续批、请求抢占 对 HF baseline 最高 24× 吞吐(早期报告) 率先把内存分页连续批标准化 已成为事实上的开源主流服务引擎之一
Throughput-Optimal Scheduling for LLM Serving (2025) (arXiv) 从理论上给连续批处理的吞吐上界/最优策略 令牌级排队/匹配策略 给出与实践接轨的理论最优性结果 偏理论基线,指导工程实现 待工程化吸收
Learning-to-Rank Scheduler (2024) 预测请求相对长度排序,更逼近 SJF 以降延迟/提吞吐 LTR 训练的调度器,按预测顺序排队/并批 相比现有基线更优的延迟/完成时间 与两者同属自动排队/批形成 可接入 vLLM/SGLang 的 admission/排队阶段。 (arXiv)
Online Scheduling with KV Constraints (2025) 把 LLM 推理抽象为带 KV 内存约束的在线调度,给出近优策略 流体/队列模型 + 在线批/KV 决策 优于 vLLM/Sarathi 基线(作者报告) 与两者同属自动在线调度理论化 可做 vLLM/SGLang 的策略插件(需低开销遥测)。 (arXiv)
Fairness-Aware Batch Formation (2025.10) 连续/混合批下自动平衡新/旧请求的“计算公平性”与吞吐 批内配额/重排策略 在保持吞吐下显著改善不公平 同属批形成自动化(与 Sarathi/Chunked 互补) 可改造 vLLM/SGLang 的 batcher。 (arXiv)
DriftPD-Multiplexing (2025) 相位解耦复用:预填与解码相位分离并“就地”复用,缓解吞吐-时延拉扯 相位解耦 + in-place compute 复用 在多负载下同时提吞吐与守 SLO 与 NanoFlow/POD 同属相位级重叠路线 需要内核/调度联动,适合作为引擎深改方向。 (Han Zhao 赵涵)

Qwen optimization

Catalogued how Alibabas Qwen-family models have been tuned across the codebase.

Optimization Matrix

Model scope Optimization & commits Why it was added Feature / hardware fit
Qwen3-VL Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443 Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput
Qwen3-VL Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296 Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation Sustains high-frame video pipelines and DP sharding without CUDA synchronizations
Qwen3-VL Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420 Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements
Qwen2.5-VL Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814 Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs
Qwen2.5-VL O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897 Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads
Qwen2-VL Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805 Allows disabling TP in favor of DP for the encoder and caches attention metadata Optimizes multi-GPU deployments handling many images per request
Qwen2/2.5-VL startup Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360 Avoids generating huge fake inputs to find token limits, slashing initialization time Keeps autoregressive limits aligned with true max resolution / frame counts before serving
Qwen series (CUDA & ROCm) Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274 Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback Maintains high-throughput rotary embedding on both CUDA and ROCm stacks
Qwen2 text & MoE FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1 Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4
Qwen3 MoE family (Next / Coder / Thinking) Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 Provides block-size / warp tuning for each TP×EP layout and precision mix Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning

Key observations

  • Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
  • Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
  • Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.

Next steps

  1. Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
  2. Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.
Model Optimization Why it was added Feature / hardware fit
Qwen3Next SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly Hybrid MoE layers with shared experts on DP/EP deployments
Qwen3Next Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342) Guarantees tokens are sharded across TP ranks before expert dispatch so TP×EP + DeepEP doesnt repeat work Large-TG throughput setups using TP+EP or DeepEP all-to-all
Qwen3Next Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59) Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding
Qwen3Next Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471) Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths
Qwen3Next Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306) Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device() Unlocks ROCm and future device backends without code forks
Qwen3Next MoE configs Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1) Ships per-topology block sizes and warp counts that match Alibabas published sweeps Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper
Qwen3 dense Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199) Exposes transformers dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up Long-context serving (>128K) on dense Qwen3 checkpoints
Qwen3MoE Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192) Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls TP×EP launches with DeepEP / allgather-reducescatter all-to-all
Qwen3MoE Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167) Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer Any MoE run with TP>1, especially bandwidth-bound clusters
Qwen3MoE FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021) Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded Delivers FP8 throughput on Triton / Cutlass backends for 64128 experts
Qwen3MoE Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467) Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models

Highlights & context

  • The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
  • SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
  • Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibabas long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.

Next steps

  1. Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
  2. If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.

Evidence: when the team first tuned Qwen2s 57B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53req/s (11058 tok/s) to 12.47req/s (13089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with handpicked block and warp sizes for that GPU. The newer GB200 FP8 tables youre looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance thats why they use more aggressive BLOCK_SIZE_N and higher num_stages.


Runtime Optimizations

Feature / Area Commits (Date) Optimization Highlights Perf Notes
Qwen2/2.5-VL vision attention rearrange 5c2acb27 (2025-10-18) Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py). Not reported; removes extra tensor copies.
Qwen3-next gated RMSNorm 82e64c7 (2025-10-12) Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py). No metric; fewer kernel launches and better occupancy.
Qwen3-next MTP bool-mask handling 785d8b6 (2025-10-16) Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction. Not stated; expected higher MTP throughput.
Qwen3-VL fast_pos_embed_interpolate 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12) Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py. No metric; reduces Python overhead for large image/video grids.
Qwen3-VL multimodal tensor prep 0426e3c5 & 2c1c7dfb (2025-10-09) _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel. Not reported; fewer allocations/kernels.
Qwen3-VL deepstack splitting 1dfea5f4 (2025-09-19) Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. No metric; lighter host-side work.
Qwen3-VL interleaved MRoPE cea91a32 (2025-09-19) Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py). Not reported; avoids slow native fallback.
Qwen3Moe fused output 4f510bc2 (2025-08-20) Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py. Not stated; less collective traffic.
Qwen3-next FP8 checkpoints ef7eefe1 (2025-09-18) Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py. No metric; enables FP8 inference path.
Qwen3 FP8 accuracy guard a258ad8b (2025-08-17) Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py). Not reported; quality fix to keep FP8 speed viable.
Qwen3 fused RMSNorm f80ae5bd (2025-05-07) Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models. No metric; fewer RMSNorm kernels.
Qwen3 reasoning parser 015069b0 (2025-05-01) Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py. Not measured; faster host parsing.
Qwen2.5-VL rotary/window pipeline 67da5720 (2025-05-16); e283976f (2025-09-09) Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py. No metrics; eliminates repeated cudaMemcpy and sorting.
Qwen2/2.5-VL CUDA sync avoidance 6772bb0f (2025-08-13); 60f0843e (2025-09-08) Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py. Not reported; prevents blocking host waits.
Qwen2.5-VL normalization & SDPA 02ed8a1f (2025-02-13); cbc8457b (2025-08-07) Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py. No metric; more efficient norm/attention.
Qwen2/2.5-VL init & DP 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency. Not reported; faster init and lower memory.
Qwen2/2.5-VL attention masks 47c7126 (2025-03-21) Restored attention mask precomputation for mixed window/full attention paths. No metrics; reduces per-forward recompute.
Qwen2-VL cudaMemcpy reduction 70b808fe (2025-03-11) Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention. Not reported; lowers cudaMemcpyAsync volume.
Qwen2 FP8 KV cache da971ec7 (2024-06-19) Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth. No metrics; expected memory savings.
Qwen2 pipeline parallel 1d2e7fb7 (2024-08-01) Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs. No metric; enables scale-out throughput.
Qwen LoRA punica specialization 1f567421 (2024-06-21); 8435b207 (2024-05-17) Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels. No metrics; better LoRA throughput.

Kernel Config Tuning (Fused MoE / FP8)

Commits (Date) Model / HW Target Optimization Perf Metrics
4d0f2661 (2025-10-20) Qwen3-30B A3/A3B on H100 (FP8 & BF16) Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes. Not reported.
f96bc364 (2025-10-15) Qwen3-Next FP8 on H100 TP=2 Introduced TP2-specific FP8 fused_moe config. Not reported.
238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07) Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100 New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage. Not reported.
8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28) Qwen3-Coder-480B-A35B on NVIDIA H20-3e Added FP8 fused_moe configs for large coder variant. Not reported.
2d40665 (2025-06-11) Qwen3-30B A3B on NVIDIA B200 Introduced B200-specific FP8 fused_moe config. Not reported.
22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08) Qwen3-235B A22B on NVIDIA H20-3e & A100 Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100). Not reported.
8fc88d63 (2025-04-28) Qwen3 MoE (H100/H200/H20 targets) Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README. Not reported.
dcbac4cb (2025-04-28) Qwen3 dense FP8 Adjusted linear layers so FP8 compatibility works with fused kernels. Not reported.
2007d4d5 & f5a3c655 (2025-05-01) Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]). Not reported.
bd439735 (2024-06-14) Qwen2-57B-A14B (TP2/TP4) Tuned fused_moe configs for A100/H100; benchmarks show +1820% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +1814% tokens/s. Throughput gains published in commit message.

Next steps (optional):

  1. Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that werent reported.
  2. For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.

Model optimization methods:

  • 减少 host/device 之间的 copy
  • 优化 fused_moe configs让 Triton 算子在特定 E/N/device/quantization 下更高效
  • 提高 CPU/GPU 之间的 overlap

notes

Mingxing Zhang:

一个 agent 的不同阶段可能业务需求就不一样,业务需求不同,需要不同 parallelism

value = function / cost GPU 利用率已经 70~80%,单一场景空间继续提升很困难,提升多场景下,不同 function 使用不同的 costvalue 最大化

mooncake 适配 vllm 和 sglang 时,工程方法不一样,需要不同的做法,需要分别优化适配推理框架


Model Optimization Summary

Kernel

fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子,保证 SM 的利用率、且 share memory、registers 不爆掉

GDNAttention for Qwen3-Next

Data Movement


Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoints architecture while staying compatible with speculative decoding and parallelism features.

Optimization Match-to-model feature Performance effect Evidence/data
Gated DeltaNet linear-attention backend with fused gating kernels Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo
Shared Fused MoE with expert parallel load balancing Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps Maintains Qwen3 Nexts shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present
NextN multi-token predictor (MTP) path Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published
Mamba-style state management for speculative decode Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied

No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.

Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.

引入 PD 分离后EPxDP 时 调度器 影响更大,在线的并行模式调整,可能不如调度器的调整,让每个 rank 收到的 pattern 相对固定