Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

56 KiB

Raw Permalink Blame History

Related Work Matrix — Workload Pattern → Serving Engine Config

Paper	Venue / Year	主要研究对象	Workload signals / patterns considered	Optimized knobs / decisions	Objective	方法类型	与“workload→config”关系	对 AITuner 的直接启发	是否已提出关键 insight
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving	arXiv 2026	多框架 LLM serving config 优化	production serving workloads；模型类型；硬件平台；SLA/TTFT/TPOT需求	TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags	latency / throughput / goodput 风格目标	性能建模 + 快速配置搜索	最直接命中	证明“最优 engine config 明显 workload-dependent”，而且 config 空间不止 scheduler	是，且最直接
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto)	arXiv 2026	KV cache 分层存储配置	real-world traces；KV block access pattern；reuse；不同优化目标	HBM/DRAM/disk tier config；eviction/group-specific cache mgmt	latency / throughput / cost Pareto frontier	simulator + pruning + adaptive tuner	直接命中，但只在 KV/storage 子空间	说明 memory-side knobs 必须 workload-aware，且天然是多目标优化	是，但局限于 KV tiering
Serving Generative Large Language Models on Preemptible Instances (SpotServe)	ASPLOS 2024	抢占式云环境下的 LLM serving	fluctuating workload；实例可用性变化；preemption trace	distributed parallelization config；migration strategy	latency / throughput / monetary cost	系统设计 + 动态重配置	部分命中	说明 config = f(workload, resource-state)，不是只由 workload 决定	是，但偏部署/并行配置
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe)	arXiv 2025	workload-aware runtime adaptation	bursty workload；real-time memory pressure；load surge	layer quantization/swapping；KV cache resizing	SLO violations、TTFT、quality-preserving efficiency	runtime adaptive serving	部分命中	说明静态最优 config 不适合 bursty workload，online adaptive config 很重要	是，但偏 runtime morphology
Llumnix: Dynamic Scheduling for Large Language Model Serving	OSDI 2024	heterogeneous requests 的动态调度	heterogeneous / unpredictable requests；不同优先级/SLO	request placement、migration、instance-level scheduling	tail latency、priority acceleration、cost saving	动态 rescheduling + live migration	邻近，不是 config tuning	说明高异质 workload 下，先做实例级隔离/迁移可能比局部 knob tuning 更重要	提出了 workload→architecture/scheduling insight

我建议你在论文里把它们再压成 4 类

A. Workload characterization / benchmark realism

这些 paper 不直接调 config，但决定你的论证是否站得住。

Paper	你该怎么引用它
ServeGen	用来证明：真实 workload 很复杂，synthetic trace 容易误导配置结论
BurstGPT	用来证明：burstiness、conversation pattern、response length 这些 workload feature 不能忽略

B. Workload-aware scheduling / routing / partitioning

这些不是在调 engine knobs，但给了你 workload pattern 的核心结构。

Paper	核心已知 insight
Llumnix	heterogeneous workload 需要跨实例 rescheduling / isolation
EWSJF	mixed workload 应先分组/分 regime，再优化
CascadeInfer	length heterogeneity 是核心 bottleneck
Sarathi-Serve	prefill/decode 冲突是第一性瓶颈
semi-PD	phase split ratio 应随 workload / SLO 改变

C. Workload-aware configuration / adaptive resource shaping

这些和你最接近。

Paper	命中子空间
AIConfigurator	通用 serving config search
Kareto	KV / storage configuration
SpotServe	distributed parallelization configuration
MorphServe	online adaptive resource / precision configuration

D. Framework / pipeline-level coarse configuration

这些告诉你研究边界在哪里。

Paper	边界意义
Comparative vLLM vs TGI	engine choice 本身就是 coarse-grained config
HERMES	multi-stage pipeline 会让 workload→config 问题更复杂

如果你想在 paper 里更突出 gap，可以直接用这张“缺口矩阵”

维度	现有工作覆盖情况	你的空间
真实 workload characterization	强	不必重复造数据集，但要吸收其 feature 定义
workload-aware scheduling	很强	不要和它们正面重合
单一子空间 config adaptation（KV / parallelism / runtime precision）	中等	可以统一这些子空间 insight
通用 workload→engine-knob mapping	弱	这是你的主要机会
在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化	很弱	这是你最强的位置

我帮你提炼成一句话版 related work positioning

你后面写论文时，可以这么说：

Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.

这句话基本能把你和上面这些 work 的关系讲清楚。

我建议你下一步补一张“feature × knob”矩阵

也就是：

行：workload features
arrival burstiness
input length mean/variance/tail
output length mean/variance/tail
mixed-workload ratio
prefix/KV locality
SLO strictness
列：config knobs
max_num_batched_tokens
scheduler mode
prefill/decode split
KV memory fraction
block size
parallelism
disaggregation mode
cache/tier policy

然后标每篇 paper 实际覆盖了哪些 feature→knob 边。

这个会比 paper matrix 更直接服务你的项目。

如果你要，我下一条就直接给你整理 feature × knob insight matrix。

ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Glia: A Human-Inspired AI for Automated Systems Design and Optimization VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE

DynamoLLM：profile（不同的 TP、workload） + search（target：满足 SLO 下最小化能耗） + dynamic 调整 parallelism configuration NanoFlow：Theoretical model + automated pipeline search（split sequence / operation，target：最大化 compute/memory/network 利用率）

https://github.com/llm-d/llm-d

AIBrix 实验性的提出了异构 GPU 推理，workload 有不同的 size，不同的 GPU 适合不同 request size 的甜点区间，offline profile，online scheduling，异构硬件 serving 时最小化 cost 这存在老生常谈的问题：offline profile 与实际 serving 的 gap https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst

agent related:

StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017
ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844
STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning

Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).

SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.

SLO-Aware SchedulingforLarge Language Model Inferences https://arxiv.org/pdf/2504.14966

https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving https://arxiv.org/abs/2602.22593

你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过：定义搜索空间/动作 → 运行试验（trial）→ 测量目标 → 用算法选择下一步动作。

**ATC’18（Cao et al.）**在解释黑箱 auto-tuning 时，把机制写得非常标准化：“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”。 https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf
**OSDI’23（Hydro）**对超参/配置调优同样给出“工作流定义”：用户指定搜索空间；算法生成 trials；系统协调执行直到找出最优配置。 https://www.usenix.org/system/files/osdi23-hu.pdf
**ATC’18（Metis）**把自己定位为“black-box optimization service”，在真实生产系统上调参并以 tail latency 为主要评估指标。 https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf
**OSDI’18（µTune）**属于“在线自适应”流派，但它仍然是同一闭环：监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。 https://www.usenix.org/system/files/osdi18-sriraman.pdf
**SOSP’21（POP）**是“数学优化/求解器”流派，它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式，并讨论求解速度与 SLA 的权衡。 https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf

结论：无论是黑箱搜索、在线控制、强化学习、还是数学规划，顶会系统优化基本都落在同一个元结构上：

在预算与约束下，围绕某个“可执行试验”的闭环，迭代地产生动作并更新策略。

Questions

Mooncake 访谈 10:00 处，业务场景变得极为丰富，需要不同的配置，PD 分离时几P几D，P 和 D 节点内分别跑什么样的并行模式都需要调优，Qwen 线上是否有明确的需要不同配置的需求？或者目前 Qwen 的现状还是根据人工测试调整，找到一个整体相对比较优的配置固定作为一个模型的配置，并不涉及线上的负载感知与动态调整？我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置？还是有一些更好的工作流？
Qwen 系列模型是否已经上线 EP？如果上线了 EP，为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案？我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能？
Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化，可以认为这是一个普遍的趋势吗？

Backup

说明：这里的“自动优化”聚焦自动化调度/并行/批处理/能耗-成本/KV-Cache 管理等系统层机制，而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张（以作者给出的 headline 为准）。

速览对照表

工作（年份/来源）	一句话核心	关键自动化机制	公开性能主张（作者给出）	与相近工作的差异点	在 vLLM / SGLang 落地
DynamoLLM (2024, HPCA’25) (arXiv)	面向集群级的能耗/成本最优化：在满足 SLO 的前提下自动重构推理集群配置	负载-功耗感知的动态集群重构、分层控制	节能 ~52–53%，运营碳减 38%，成本降 61%，SLO 保持	目标是服务级/集群级运行点最优化，而非 GPU 内核/批处理细节	与 vLLM/SGLang 并行部署（外部编排层），非内嵌；可用作上层资源控制器
NanoFlow (2024→2025) (arXiv)	在单 GPU上把一次请求切成多个nano-batches并自动搜寻并行-重叠流水线	自动搜索nano-batch 的数量/大小/顺序/资源配额；算子共调度以重叠算力/显存/通信	对多基线（vLLM、FastGen、TRT-LLM）最高 1.91× 吞吐提升	强调设备内并行与流水重叠；对 decode 轻/ prefll 重的结构做细粒度流水	目前为独立运行时（开源实现），未并入 vLLM/SGLang 主干；可作替代后端 (GitHub)
Sarathi & Sarathi-Serve (2023–2024) (arXiv)	把长prefill切块，并与decode混合成连续混批，减少流水“气泡”	Chunked prefill + continuous hybrid batching（decode piggyback）	端到端最高 1.91×；decode 吞吐最高 10×	最早系统化提出“prefill-decode 混合一批”的通用策略	vLLM/SGLang 均已支持/吸收该思路（见下文“落地”） (VLLM Documentation)
DeepSpeed-FastGen (2024) (arXiv)	Blocked KV + Dynamic SplitFuse 连续批处理，兼顾低延迟与高吞吐	动态拆分/融合批次，KV 分块复用	多模型/硬件上优于 vLLM（作者报告）	方案与 vLLM 的 PagedAttention 路线相近但实现不同	作为独立后端；与 vLLM/SGLang 互为替代
POD-Attention (ASPLOS’25) (Microsoft)	追求prefill 与 decode 的全重叠，降低 Token Break Time（TBT）	Prefill/Decode 完全并行化调度与相容内核	在长上下文/高 TBT 场景显著降停顿（定性+量化）	把 Sarathi 的“混合批”更推进到近全重叠	学术原型；思想可迁移到 vLLM/SGLang 的调度/内核层
Fluid-Guided Online Scheduling（WAIT/Nested WAIT） (2025, SSRN) (SSRN)	把 LLM 推理抽象为带 KV 内存约束的多阶段在线调度，给出近似最优策略	基于流体模型的在线批量与内存配给决策	实验优于 vLLM/Sarathi 的吞吐/时延（作者报告）	强理论导向，内存-批量-时延三者联动的在线算法	研究性，尚未并入主流实现
Memory-aware 动态批处理 (2025) (arXiv)	运行时监控显存与 SLA，自适应批大小与解码过程	显存感知的批调度 + 延迟反馈回路	在 Llama-7B+A100 上优于固定超参的吞吐/时延	更工程化的在线批超参调节方法	思路与 vLLM/SGLang 的自调度兼容；尚无主干合并记录
HyGen (2025) (arXiv)	线上/离线融合：两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力	两阶段 SLO-aware 调度与隔离	在离线共置下维持在线延迟并提升整体利用率	关注业务混部而非单一推理吞吐	可与任一后端并行部署；非 vLLM/SGLang 内核改动
PrefillOnly (2025) (arXiv)	针对只需 1 token 输出的“prefill-only”工作负载做极简 KV与路径优化	仅保留最后一层 KV、轻量运行路径	对检索/分类式工作负载显著提速降时延	面向特定负载类型的路径裁剪	可作为特型后端/路径对接主流引擎
vLLM / PagedAttention (2023→) (arXiv)	分页化 KV-Cache + 预占式调度，近零碎片、易做连续批处理	块级内存管理、连续批、请求抢占	对 HF baseline 最高 24× 吞吐（早期报告）	率先把内存分页与连续批标准化	已成为事实上的开源主流服务引擎之一
Throughput-Optimal Scheduling for LLM Serving (2025) (arXiv)	从理论上给连续批处理的吞吐上界/最优策略	令牌级排队/匹配策略	给出与实践接轨的理论最优性结果	偏理论基线，指导工程实现	待工程化吸收
Learning-to-Rank Scheduler (2024)	预测请求相对长度排序，更逼近 SJF 以降延迟/提吞吐	LTR 训练的调度器，按预测顺序排队/并批	相比现有基线更优的延迟/完成时间	与两者同属自动排队/批形成	可接入 vLLM/SGLang 的 admission/排队阶段。 (arXiv)
Online Scheduling with KV Constraints (2025)	把 LLM 推理抽象为带 KV 内存约束的在线调度，给出近优策略	流体/队列模型 + 在线批/KV 决策	优于 vLLM/Sarathi 基线（作者报告）	与两者同属自动在线调度理论化	可做 vLLM/SGLang 的策略插件（需低开销遥测）。 (arXiv)
Fairness-Aware Batch Formation (2025.10)	在连续/混合批下自动平衡新/旧请求的“计算公平性”与吞吐	批内配额/重排策略	在保持吞吐下显著改善不公平	同属批形成自动化（与 Sarathi/Chunked 互补）	可改造 vLLM/SGLang 的 batcher。 (arXiv)
Drift（PD-Multiplexing） (2025)	相位解耦复用：预填与解码相位分离并“就地”复用，缓解吞吐-时延拉扯	相位解耦 + in-place compute 复用	在多负载下同时提吞吐与守 SLO	与 NanoFlow/POD 同属相位级重叠路线	需要内核/调度联动，适合作为引擎深改方向。 (Han Zhao 赵涵)

Qwen optimization

Catalogued how Alibaba’s Qwen-family models have been tuned across the codebase.

Optimization Matrix

Model scope	Optimization & commits	Why it was added	Feature / hardware fit
Qwen3-VL	Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443	Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids	Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput
Qwen3-VL	Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296	Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation	Sustains high-frame video pipelines and DP sharding without CUDA synchronizations
Qwen3-VL	Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420	Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability	Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements
Qwen2.5-VL	Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814	Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback	Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs
Qwen2.5-VL	O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897	Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously	Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads
Qwen2-VL	Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805	Allows disabling TP in favor of DP for the encoder and caches attention metadata	Optimizes multi-GPU deployments handling many images per request
Qwen2/2.5-VL startup	Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360	Avoids generating huge fake inputs to find token limits, slashing initialization time	Keeps autoregressive limits aligned with true max resolution / frame counts before serving
Qwen series (CUDA & ROCm)	Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274	Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback	Maintains high-throughput rotary embedding on both CUDA and ROCm stacks
Qwen2 text & MoE	FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1	Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels	Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4
Qwen3 MoE family (Next / Coder / Thinking)	Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1	Provides block-size / warp tuning for each TP×EP layout and precision mix	Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning

Key observations

Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.

Next steps

Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.

Model	Optimization	Why it was added	Feature / hardware fit
Qwen3‑Next	SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31)	Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly	Hybrid MoE layers with shared experts on DP/EP deployments
Qwen3‑Next	Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342)	Guarantees tokens are sharded across TP ranks before expert dispatch so TP × EP + DeepEP doesn’t repeat work	Large-TG throughput setups using TP+EP or DeepEP all-to-all
Qwen3‑Next	Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59)	Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills	Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding
Qwen3‑Next	Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471)	Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges	Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths
Qwen3‑Next	Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306)	Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device()	Unlocks ROCm and future device backends without code forks
Qwen3‑Next MoE configs	Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1)	Ships per-topology block sizes and warp counts that match Alibaba’s published sweeps	Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper
Qwen3 dense	Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199)	Exposes transformers’ dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up	Long-context serving (>128 K) on dense Qwen3 checkpoints
Qwen3‑MoE	Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192)	Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls	TP × EP launches with DeepEP / allgather-reducescatter all-to-all
Qwen3‑MoE	Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167)	Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer	Any MoE run with TP>1, especially bandwidth-bound clusters
Qwen3‑MoE	FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021)	Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded	Delivers FP8 throughput on Triton / Cutlass backends for 64–128 experts
Qwen3‑MoE	Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467)	Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban	Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models

Highlights & context

The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibaba’s long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.

Next steps

Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.

Evidence: when the team first tuned Qwen2’s 57 B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53 req/s (11 058 tok/s) to 12.47 req/s (13 089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with hand‑picked block and warp sizes for that GPU. The newer GB200 FP8 tables you’re looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance – that’s why they use more aggressive BLOCK_SIZE_N and higher num_stages.

Runtime Optimizations

Feature / Area	Commits (Date)	Optimization Highlights	Perf Notes
Qwen2/2.5-VL vision attention rearrange	5c2acb27 (2025-10-18)	Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py).	Not reported; removes extra tensor copies.
Qwen3-next gated RMSNorm	82e64c7 (2025-10-12)	Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py).	No metric; fewer kernel launches and better occupancy.
Qwen3-next MTP bool-mask handling	785d8b6 (2025-10-16)	Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction.	Not stated; expected higher MTP throughput.
Qwen3-VL fast_pos_embed_interpolate	30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12)	Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py.	No metric; reduces Python overhead for large image/video grids.
Qwen3-VL multimodal tensor prep	0426e3c5 & 2c1c7dfb (2025-10-09)	_validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel.	Not reported; fewer allocations/kernels.
Qwen3-VL deepstack splitting	1dfea5f4 (2025-09-19)	Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py.	No metric; lighter host-side work.
Qwen3-VL interleaved MRoPE	cea91a32 (2025-09-19)	Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py).	Not reported; avoids slow native fallback.
Qwen3Moe fused output	4f510bc2 (2025-08-20)	Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py.	Not stated; less collective traffic.
Qwen3-next FP8 checkpoints	ef7eefe1 (2025-09-18)	Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py.	No metric; enables FP8 inference path.
Qwen3 FP8 accuracy guard	a258ad8b (2025-08-17)	Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py).	Not reported; quality fix to keep FP8 speed viable.
Qwen3 fused RMSNorm	f80ae5bd (2025-05-07)	Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models.	No metric; fewer RMSNorm kernels.
Qwen3 reasoning parser	015069b0 (2025-05-01)	Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py.	Not measured; faster host parsing.
Qwen2.5-VL rotary/window pipeline	67da5720 (2025-05-16); e283976f (2025-09-09)	Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py.	No metrics; eliminates repeated cudaMemcpy and sorting.
Qwen2/2.5-VL CUDA sync avoidance	6772bb0f (2025-08-13); 60f0843e (2025-09-08)	Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py.	Not reported; prevents blocking host waits.
Qwen2.5-VL normalization & SDPA	02ed8a1f (2025-02-13); cbc8457b (2025-08-07)	Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py.	No metric; more efficient norm/attention.
Qwen2/2.5-VL init & DP	1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18)	Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency.	Not reported; faster init and lower memory.
Qwen2/2.5-VL attention masks	47c7126 (2025-03-21)	Restored attention mask precomputation for mixed window/full attention paths.	No metrics; reduces per-forward recompute.
Qwen2-VL cudaMemcpy reduction	70b808fe (2025-03-11)	Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention.	Not reported; lowers cudaMemcpyAsync volume.
Qwen2 FP8 KV cache	da971ec7 (2024-06-19)	Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth.	No metrics; expected memory savings.
Qwen2 pipeline parallel	1d2e7fb7 (2024-08-01)	Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs.	No metric; enables scale-out throughput.
Qwen LoRA punica specialization	1f567421 (2024-06-21); 8435b207 (2024-05-17)	Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels.	No metrics; better LoRA throughput.

Kernel Config Tuning (Fused MoE / FP8)

Commits (Date)	Model / HW Target	Optimization	Perf Metrics
4d0f2661 (2025-10-20)	Qwen3-30B A3/A3B on H100 (FP8 & BF16)	Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes.	Not reported.
f96bc364 (2025-10-15)	Qwen3-Next FP8 on H100 TP=2	Introduced TP2-specific FP8 fused_moe config.	Not reported.
238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07)	Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100	New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage.	Not reported.
8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28)	Qwen3-Coder-480B-A35B on NVIDIA H20-3e	Added FP8 fused_moe configs for large coder variant.	Not reported.
2d40665 (2025-06-11)	Qwen3-30B A3B on NVIDIA B200	Introduced B200-specific FP8 fused_moe config.	Not reported.
22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08)	Qwen3-235B A22B on NVIDIA H20-3e & A100	Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100).	Not reported.
8fc88d63 (2025-04-28)	Qwen3 MoE (H100/H200/H20 targets)	Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README.	Not reported.
dcbac4cb (2025-04-28)	Qwen3 dense FP8	Adjusted linear layers so FP8 compatibility works with fused kernels.	Not reported.
2007d4d5 & f5a3c655 (2025-05-01)	Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X	Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]).	Not reported.
bd439735 (2024-06-14)	Qwen2-57B-A14B (TP2/TP4)	Tuned fused_moe configs for A100/H100; benchmarks show +18–20% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +18–14% tokens/s.	Throughput gains published in commit message.

Next steps (optional):

Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that weren’t reported.
For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.

Model optimization methods:

减少 host/device 之间的 copy
优化 fused_moe configs，让 Triton 算子在特定 E/N/device/quantization 下更高效
提高 CPU/GPU 之间的 overlap

notes

Mingxing Zhang:

一个 agent 的不同阶段可能业务需求就不一样，业务需求不同，需要不同 parallelism

value = function / cost GPU 利用率已经 70~80%，单一场景空间继续提升很困难，提升多场景下，不同 function 使用不同的 cost，value 最大化

mooncake 适配 vllm 和 sglang 时，工程方法不一样，需要不同的做法，需要分别优化适配推理框架

Model Optimization Summary

Kernel

fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子，保证 SM 的利用率、且 share memory、registers 不爆掉

GDNAttention for Qwen3-Next

Data Movement

Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoint’s architecture while staying compatible with speculative decoding and parallelism features.

Optimization	Match-to-model feature	Performance effect	Evidence/data
Gated DeltaNet linear-attention backend with fused gating kernels	Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs	Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance	Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo
Shared Fused MoE with expert parallel load balancing	Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps	Maintains Qwen3 Next’s shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost	Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present
NextN multi-token predictor (MTP) path	Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner	Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled	MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published
Mamba-style state management for speculative decode	Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next	Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models	State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied

No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.

Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.

引入 PD 分离后，EPxDP 时调度器影响更大，在线的并行模式调整，可能不如调度器的调整，让每个 rank 收到的 pattern 相对固定

56 KiB Raw Permalink Blame History Unescape Escape

related works