Related Work Matrix — Workload Pattern → Serving Engine Config | Paper | Venue / Year | 主要研究对象 | Workload signals / patterns considered | Optimized knobs / decisions | Objective | 方法类型 | 与“workload→config”关系 | 对 AITuner 的直接启发 | 是否已提出关键 insight | | ------------------------------------------------------------------------------------------------------ | ------------ | --------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | ------------------------------------------------- | ------------------------------------ | ----------------------- | ------------------------------------------------------------------- | -------------------------------------------- | | AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving | arXiv 2026 | 多框架 LLM serving config 优化 | production serving workloads;模型类型;硬件平台;SLA/TTFT/TPOT需求 | TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags | latency / throughput / goodput 风格目标 | 性能建模 + 快速配置搜索 | 最直接命中 | 证明“最优 engine config 明显 workload-dependent”,而且 config 空间不止 scheduler | 是,且最直接 | | Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto) | arXiv 2026 | KV cache 分层存储配置 | real-world traces;KV block access pattern;reuse;不同优化目标 | HBM/DRAM/disk tier config;eviction/group-specific cache mgmt | latency / throughput / cost Pareto frontier | simulator + pruning + adaptive tuner | 直接命中,但只在 KV/storage 子空间 | 说明 memory-side knobs 必须 workload-aware,且天然是多目标优化 | 是,但局限于 KV tiering | | Serving Generative Large Language Models on Preemptible Instances (SpotServe) | ASPLOS 2024 | 抢占式云环境下的 LLM serving | fluctuating workload;实例可用性变化;preemption trace | distributed parallelization config;migration strategy | latency / throughput / monetary cost | 系统设计 + 动态重配置 | 部分命中 | 说明 config = f(workload, resource-state),不是只由 workload 决定 | 是,但偏部署/并行配置 | | Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) | arXiv 2025 | workload-aware runtime adaptation | bursty workload;real-time memory pressure;load surge | layer quantization/swapping;KV cache resizing | SLO violations、TTFT、quality-preserving efficiency | runtime adaptive serving | 部分命中 | 说明静态最优 config 不适合 bursty workload,online adaptive config 很重要 | 是,但偏 runtime morphology | | Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI 2024 | heterogeneous requests 的动态调度 | heterogeneous / unpredictable requests;不同优先级/SLO | request placement、migration、instance-level scheduling | tail latency、priority acceleration、cost saving | 动态 rescheduling + live migration | 邻近,不是 config tuning | 说明高异质 workload 下,先做实例级隔离/迁移可能比局部 knob tuning 更重要 | 提出了 workload→architecture/scheduling insight | - 1/2 - 1 - 2 | Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request rate;phase contention;SLO pressure | prefill/decode partition ratio;resource controller;unified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob,而且 workload-dependent | 是,但聚焦 phase partition | | Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client composition;temporal structure;model mix;multimodal/reasoning;large-scale production traces | 不直接优化 config | benchmark realism;avoid under-provisioning | characterization + workload generator | 基础支撑,不直接命中 | 说明如果 workload 表征不真实,任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight | | BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstiness;conversation patterns;response lengths;system failures | 不直接优化 config | evaluation realism;stress realistic serving behavior | trace dataset / characterization | 基础支撑,不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是,提出真实 workload 重要性 | | Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency level;interactive vs batch use case;model size | framework choice(vLLM vs TGI) | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 | | Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latency;pipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你:未来 workload→config 问题会扩展到 multi-stage,不再只看 prefill/decode | 部分提出,多阶段视角 | 我建议你在论文里把它们再压成 4 类 A. Workload characterization / benchmark realism 这些 paper 不直接调 config,但决定你的论证是否站得住。 |Paper|你该怎么引用它| |---|---| |ServeGen|用来证明:真实 workload 很复杂,synthetic trace 容易误导配置结论| |BurstGPT|用来证明:burstiness、conversation pattern、response length 这些 workload feature 不能忽略| B. Workload-aware scheduling / routing / partitioning 这些不是在调 engine knobs,但给了你 workload pattern 的核心结构。 |Paper|核心已知 insight| |---|---| |Llumnix|heterogeneous workload 需要跨实例 rescheduling / isolation| |EWSJF|mixed workload 应先分组/分 regime,再优化| |CascadeInfer|length heterogeneity 是核心 bottleneck| |Sarathi-Serve|prefill/decode 冲突是第一性瓶颈| |semi-PD|phase split ratio 应随 workload / SLO 改变| C. Workload-aware configuration / adaptive resource shaping 这些和你最接近。 |Paper|命中子空间| |---|---| |AIConfigurator|通用 serving config search| |Kareto|KV / storage configuration| |SpotServe|distributed parallelization configuration| |MorphServe|online adaptive resource / precision configuration| D. Framework / pipeline-level coarse configuration 这些告诉你研究边界在哪里。 |Paper|边界意义| |---|---| |Comparative vLLM vs TGI|engine choice 本身就是 coarse-grained config| |HERMES|multi-stage pipeline 会让 workload→config 问题更复杂| 如果你想在 paper 里更突出 gap,可以直接用这张“缺口矩阵” |维度|现有工作覆盖情况|你的空间| |---|---|---| |真实 workload characterization|强|不必重复造数据集,但要吸收其 feature 定义| |workload-aware scheduling|很强|不要和它们正面重合| |单一子空间 config adaptation(KV / parallelism / runtime precision)|中等|可以统一这些子空间 insight| |通用 workload→engine-knob mapping|弱|这是你的主要机会| |在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化|很弱|这是你最强的位置| - 1/2 - 1 - 2 我帮你提炼成一句话版 related work positioning 你后面写论文时,可以这么说: Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target. 这句话基本能把你和上面这些 work 的关系讲清楚。 我建议你下一步补一张“feature × knob”矩阵 也就是: - 行:workload features - arrival burstiness - input length mean/variance/tail - output length mean/variance/tail - mixed-workload ratio - prefix/KV locality - SLO strictness - 列:config knobs - max_num_batched_tokens - scheduler mode - prefill/decode split - KV memory fraction - block size - parallelism - disaggregation mode - cache/tier policy 然后标每篇 paper 实际覆盖了哪些 feature→knob 边。 这个会比 paper matrix 更直接服务你的项目。 如果你要,我下一条就直接给你整理 feature × knob insight matrix。 [ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999) [Glia: A Human-Inspired AI for Automated Systems Design and Optimization](https://arxiv.org/pdf/2510.27176) [VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE](https://arxiv.org/pdf/2405.05465) [DynamoLLM](https://arxiv.org/pdf/2408.00741):profile(不同的 TP、workload) + search(target:满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration [NanoFlow](https://arxiv.org/pdf/2408.12757):Theoretical model + automated pipeline search(split sequence / operation,target:最大化 compute/memory/network 利用率) https://github.com/llm-d/llm-d AIBrix 实验性的提出了异构 GPU 推理,workload 有不同的 size,不同的 GPU 适合不同 request size 的甜点区间,offline profile,online scheduling,异构硬件 serving 时最小化 cost 这存在老生常谈的问题:offline profile 与实际 serving 的 gap https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst agent related: - StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017 - ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844 - STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887 AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288 SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323 https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning > Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC). > > SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig. ![[projects/auto-tuner/related-works.figs/260410-105227.png]] SLO-Aware SchedulingforLarge Language Model Inferences https://arxiv.org/pdf/2504.14966 https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving https://arxiv.org/abs/2602.22593 你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过:**定义搜索空间/动作 → 运行试验(trial)→ 测量目标 → 用算法选择下一步动作**。 - **ATC’18(Cao et al.)**在解释黑箱 auto-tuning 时,把机制写得非常标准化:_“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”_。 https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf - **OSDI’23(Hydro)**对超参/配置调优同样给出“工作流定义”:用户指定搜索空间;算法生成 trials;系统协调执行直到找出最优配置。 https://www.usenix.org/system/files/osdi23-hu.pdf - **ATC’18(Metis)**把自己定位为“black-box optimization service”,在真实生产系统上调参并以 tail latency 为主要评估指标。 https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf - **OSDI’18(µTune)**属于“在线自适应”流派,但它仍然是同一闭环:监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。 https://www.usenix.org/system/files/osdi18-sriraman.pdf - **SOSP’21(POP)**是“数学优化/求解器”流派,它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式,并讨论求解速度与 SLA 的权衡。 https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf **结论**:无论是黑箱搜索、在线控制、强化学习、还是数学规划,顶会系统优化基本都落在同一个元结构上: > **在预算与约束下,围绕某个“可执行试验”的闭环,迭代地产生动作并更新策略。** --- # related works [DynamoLLM](https://arxiv.org/pdf/2408.00741):profile(不同的 TP、workload) + search(target:满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration [NanoFlow](https://arxiv.org/pdf/2408.12757):Theoretical model + automated pipeline search(split sequence / operation,target:最大化 compute/memory/network 利用率) [MorphServe](https://arxiv.org/pdf/2506.02006v1) [Autocomp](https://arxiv.org/pdf/2505.18574v3) # Questions 1. [Mooncake 访谈](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK) 10:00 处,业务场景变得极为丰富,需要不同的配置,PD 分离时几P几D,P 和 D 节点内分别跑什么样的并行模式都需要调优,Qwen 线上是否有明确的需要不同配置的需求?或者目前 Qwen 的现状还是根据人工测试调整,找到一个整体相对比较优的配置固定作为一个模型的配置,并不涉及线上的负载感知与动态调整?我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置?还是有一些更好的工作流? 2. Qwen 系列模型是否已经上线 EP?如果上线了 EP,为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案?我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能? 3. Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化,可以认为这是一个普遍的趋势吗? --- # Backup > 说明:这里的“自动优化”聚焦**自动化调度/并行/批处理/能耗-成本/KV-Cache 管理**等系统层机制,而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张(以作者给出的 headline 为准)。 ### 速览对照表 | 工作(年份/来源) | 一句话核心 | 关键自动化机制 | 公开性能主张(作者给出) | 与相近工作的差异点 | 在 vLLM / SGLang 落地 | | ----------------------------------------------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------------- | ---------------------------------------------- | ---------------------------------------------------------- | | **DynamoLLM** (2024, HPCA’25) ([arXiv][1]) | 面向**集群级**的能耗/成本最优化:在满足 SLO 的前提下**自动重构**推理集群配置 | 负载-功耗感知的**动态集群重构**、分层控制 | 节能 ~52–53%,运营碳减 38%,成本降 61%,SLO 保持 | 目标是**服务级/集群级**运行点最优化,而非 GPU 内核/批处理细节 | 与 vLLM/SGLang **并行部署**(外部编排层),非内嵌;可用作上层资源控制器 | | **NanoFlow** (2024→2025) ([arXiv][2]) | 在**单 GPU**上把一次请求切成多个**nano-batches**并自动搜寻**并行-重叠流水线** | **自动搜索**nano-batch 的数量/大小/顺序/资源配额;算子共调度以重叠算力/显存/通信 | 对多基线(vLLM、FastGen、TRT-LLM)最高 **1.91×** 吞吐提升 | 强调**设备内**并行与流水重叠;对 decode 轻/ prefll 重的结构做细粒度流水 | 目前为**独立运行时**(开源实现),未并入 vLLM/SGLang 主干;可作替代后端 ([GitHub][3]) | | **Sarathi & Sarathi-Serve** (2023–2024) ([arXiv][4]) | 把长**prefill**切块,并与**decode**混合成连续混批,减少流水“气泡” | **Chunked prefill** + **continuous hybrid batching**(decode piggyback) | 端到端最高 **1.91×**;decode 吞吐最高 **10×** | 最早系统化提出“**prefill-decode 混合**一批”的通用策略 | vLLM/SGLang 均已支持/吸收该思路(见下文“落地”) ([VLLM Documentation][5]) | | **DeepSpeed-FastGen** (2024) ([arXiv][6]) | **Blocked KV** + **Dynamic SplitFuse 连续批处理**,兼顾低延迟与高吞吐 | 动态拆分/融合批次,KV 分块复用 | 多模型/硬件上优于 vLLM(作者报告) | 方案与 vLLM 的 PagedAttention 路线相近但实现不同 | 作为**独立后端**;与 vLLM/SGLang 互为替代 | | **POD-Attention** (ASPLOS’25) ([Microsoft][7]) | 追求**prefill 与 decode 的全重叠**,降低 Token Break Time(TBT) | Prefill/Decode **完全并行化调度**与相容内核 | 在长上下文/高 TBT 场景显著降停顿(定性+量化) | 把 Sarathi 的“混合批”更推进到**近全重叠** | 学术原型;思想可迁移到 vLLM/SGLang 的调度/内核层 | | **Fluid-Guided Online Scheduling(WAIT/Nested WAIT)** (2025, SSRN) ([SSRN][8]) | 把 LLM 推理抽象为**带 KV 内存约束**的多阶段**在线调度**,给出近似最优策略 | 基于流体模型的**在线批量与内存配给**决策 | 实验优于 vLLM/Sarathi 的吞吐/时延(作者报告) | 强理论导向,**内存-批量-时延**三者联动的在线算法 | 研究性,尚未并入主流实现 | | **Memory-aware 动态批处理** (2025) ([arXiv][9]) | 运行时监控**显存与 SLA**,**自适应**批大小与解码过程 | 显存感知的批调度 + 延迟反馈回路 | 在 Llama-7B+A100 上优于固定超参的吞吐/时延 | 更工程化的**在线批超参**调节方法 | 思路与 vLLM/SGLang 的自调度兼容;尚无主干合并记录 | | **HyGen** (2025) ([arXiv][10]) | **线上/离线**融合:两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力 | 两阶段 SLO-aware 调度与隔离 | 在离线共置下维持在线延迟并提升整体利用率 | 关注**业务混部**而非单一推理吞吐 | 可与任一后端并行部署;非 vLLM/SGLang 内核改动 | | **PrefillOnly** (2025) ([arXiv][11]) | 针对**只需 1 token 输出**的“prefill-only”工作负载做**极简 KV**与路径优化 | 仅保留**最后一层** KV、轻量运行路径 | 对检索/分类式工作负载显著提速降时延 | 面向特定负载类型的**路径裁剪** | 可作为特型后端/路径对接主流引擎 | | **vLLM / PagedAttention** (2023→) ([arXiv][12]) | **分页化 KV-Cache** + 预占式调度,近零碎片、易做连续批处理 | 块级内存管理、连续批、请求抢占 | 对 HF baseline 最高 **24×** 吞吐(早期报告) | 率先把**内存分页**与**连续批**标准化 | 已成为事实上的开源主流服务引擎之一 | | **Throughput-Optimal Scheduling for LLM Serving** (2025) ([arXiv][13]) | 从理论上给**连续批处理**的吞吐上界/最优策略 | 令牌级排队/匹配策略 | 给出与实践接轨的理论最优性结果 | 偏理论基线,指导工程实现 | 待工程化吸收 | | **Learning-to-Rank Scheduler** (2024) | 预测请求**相对长度排序**,更逼近 SJF 以降延迟/提吞吐 | LTR 训练的调度器,按预测顺序排队/并批 | 相比现有基线更优的延迟/完成时间 | 与两者同属**自动排队/批形成** | 可接入 vLLM/SGLang 的 admission/排队阶段。 ([arXiv][5]) | | **Online Scheduling with KV Constraints** (2025) | 把 LLM 推理抽象为**带 KV 内存约束**的在线调度,给出近优策略 | 流体/队列模型 + 在线批/KV 决策 | 优于 vLLM/Sarathi 基线(作者报告) | 与两者同属**自动在线调度**理论化 | 可做 vLLM/SGLang 的**策略插件**(需低开销遥测)。 ([arXiv][8]) | | **Fairness-Aware Batch Formation** (2025.10) | 在**连续/混合批**下自动平衡新/旧请求的“计算公平性”与吞吐 | 批内配额/重排策略 | 在保持吞吐下显著改善不公平 | 同属批形成自动化(与 Sarathi/Chunked 互补) | 可改造 vLLM/SGLang 的 batcher。 ([arXiv][10]) | | **Drift(PD-Multiplexing)** (2025) | **相位解耦复用**:预填与解码相位分离并“就地”复用,缓解吞吐-时延拉扯 | 相位解耦 + in-place compute 复用 | 在多负载下同时提吞吐与守 SLO | 与 NanoFlow/POD 同属**相位级重叠**路线 | 需要内核/调度联动,适合作为引擎深改方向。 ([Han Zhao 赵涵][11]) | [1]: https://arxiv.org/html/2408.00741v1?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..." [2]: https://arxiv.org/abs/2408.12757?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..." [3]: https://github.com/efeslab/Nanoflow?utm_source=chatgpt.com "efeslab/Nanoflow: A throughput-oriented high-performance ..." [4]: https://arxiv.org/abs/2308.16369?utm_source=chatgpt.com "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" [5]: https://docs.vllm.ai/en/v0.4.2/models/performance.html?utm_source=chatgpt.com "Performance and Tuning - vLLM" [6]: https://arxiv.org/pdf/2401.08671?utm_source=chatgpt.com "DeepSpeed-FastGen: High-throughput Text Generation for ..." [7]: https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/POD-Attention-ASPLOS25.pdf?utm_source=chatgpt.com "POD-Attention: Unlocking Full Prefill-Decode Overlap for ..." [8]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5195463&utm_source=chatgpt.com "Optimizing LLM Inference: Fluid-Guided Online Scheduling ..." [9]: https://arxiv.org/pdf/2503.05248?utm_source=chatgpt.com "Optimizing LLM Inference Throughput via Memory-aware ..." [10]: https://arxiv.org/html/2501.14808v2?utm_source=chatgpt.com "1 Introduction" [11]: https://arxiv.org/html/2505.07203v1?utm_source=chatgpt.com "PrefillOnly: An Inference Engine for Prefill-only Workloads ..." [12]: https://arxiv.org/pdf/2309.06180?utm_source=chatgpt.com "Efficient Memory Management for Large Language Model ..." [13]: https://arxiv.org/html/2504.07347v1?utm_source=chatgpt.com "Throughput-Optimal Scheduling Algorithms for LLM ..." [14]: https://homes.cs.washington.edu/~arvind/papers/nanoflow.pdf?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..." [15]: https://discuss.vllm.ai/t/does-the-vllm-v1-support-speculative-decoding-now/191?utm_source=chatgpt.com "Does the vLLM v1 support Speculative Decoding now?" [16]: https://github.com/sgl-project/sglang/issues/2273?utm_source=chatgpt.com "[Kernel] Launch two kernels for mixed chunked prefill #2273" [17]: https://github.com/sgl-project/sglang/issues/6553?utm_source=chatgpt.com "[PD] Support Multi-Process for TokenizerManager #6553" [18]: https://arxiv.org/html/2312.07104v1?utm_source=chatgpt.com "Efficiently Programming Large Language Models using ..." [19]: https://iacoma.cs.uiuc.edu/iacoma-papers/hpca25_2.pdf?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..." ## Qwen optimization Catalogued how Alibaba’s Qwen-family models have been tuned across the codebase. **Optimization Matrix** | Model scope | Optimization & commits | Why it was added | Feature / hardware fit | | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | Qwen3-VL | Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443 | Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids | Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput | | Qwen3-VL | Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296 | Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation | Sustains high-frame video pipelines and DP sharding without CUDA synchronizations | | Qwen3-VL | Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420 | Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability | Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements | | Qwen2.5-VL | Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814 | Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback | Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs | | Qwen2.5-VL | O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897 | Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously | Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads | | Qwen2-VL | Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805 | Allows disabling TP in favor of DP for the encoder and caches attention metadata | Optimizes multi-GPU deployments handling many images per request | | Qwen2/2.5-VL startup | Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360 | Avoids generating huge fake inputs to find token limits, slashing initialization time | Keeps autoregressive limits aligned with true max resolution / frame counts before serving | | Qwen series (CUDA & ROCm) | Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274 | Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback | Maintains high-throughput rotary embedding on both CUDA and ROCm stacks | | Qwen2 text & MoE | FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1 | Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels | Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4 | | Qwen3 MoE family (Next / Coder / Thinking) | Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 | Provides block-size / warp tuning for each TP×EP layout and precision mix | Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning | Key observations - Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound. - Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers. - Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput. Next steps 1. Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads. 2. Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology. | Model | Optimization | Why it was added | Feature / hardware fit | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- | | Qwen3‑Next | SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) | Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly | Hybrid MoE layers with shared experts on DP/EP deployments | | Qwen3‑Next | Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342) | Guarantees tokens are sharded across TP ranks before expert dispatch so TP × EP + DeepEP doesn’t repeat work | Large-TG throughput setups using TP+EP or DeepEP all-to-all | | Qwen3‑Next | Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59) | Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills | Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding | | Qwen3‑Next | Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471) | Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges | Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths | | Qwen3‑Next | Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306) | Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device() | Unlocks ROCm and future device backends without code forks | | Qwen3‑Next MoE configs | Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1) | Ships per-topology block sizes and warp counts that match Alibaba’s published sweeps | Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper | | Qwen3 dense | Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199) | Exposes transformers’ dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up | Long-context serving (>128 K) on dense Qwen3 checkpoints | | Qwen3‑MoE | Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192) | Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls | TP × EP launches with DeepEP / allgather-reducescatter all-to-all | | Qwen3‑MoE | Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167) | Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer | Any MoE run with TP>1, especially bandwidth-bound clusters | | Qwen3‑MoE | FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021) | Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded | Delivers FP8 throughput on Triton / Cutlass backends for 64–128 experts | | Qwen3‑MoE | Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467) | Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban | Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models | Highlights & context - The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active. - SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner. - Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibaba’s long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set. Next steps 1. Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput. 2. If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect. Evidence: when the team first tuned Qwen2’s 57 B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53 req/s (11 058 tok/s) to 12.47 req/s (13 089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with hand‑picked block and warp sizes for that GPU. The newer GB200 FP8 tables you’re looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance – that’s why they use more aggressive BLOCK_SIZE_N and higher num_stages. --- **Runtime Optimizations** | Feature / Area | Commits (Date) | Optimization Highlights | Perf Notes | | --------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- | | Qwen2/2.5-VL vision attention rearrange | 5c2acb27 (2025-10-18) | Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py). | Not reported; removes extra tensor copies. | | Qwen3-next gated RMSNorm | 82e64c7 (2025-10-12) | Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py). | No metric; fewer kernel launches and better occupancy. | | Qwen3-next MTP bool-mask handling | 785d8b6 (2025-10-16) | Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction. | Not stated; expected higher MTP throughput. | | Qwen3-VL fast_pos_embed_interpolate | 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12) | Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py. | No metric; reduces Python overhead for large image/video grids. | | Qwen3-VL multimodal tensor prep | 0426e3c5 & 2c1c7dfb (2025-10-09) | _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel. | Not reported; fewer allocations/kernels. | | Qwen3-VL deepstack splitting | 1dfea5f4 (2025-09-19) | Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. | No metric; lighter host-side work. | | Qwen3-VL interleaved MRoPE | cea91a32 (2025-09-19) | Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py). | Not reported; avoids slow native fallback. | | Qwen3Moe fused output | 4f510bc2 (2025-08-20) | Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py. | Not stated; less collective traffic. | | Qwen3-next FP8 checkpoints | ef7eefe1 (2025-09-18) | Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py. | No metric; enables FP8 inference path. | | Qwen3 FP8 accuracy guard | a258ad8b (2025-08-17) | Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py). | Not reported; quality fix to keep FP8 speed viable. | | Qwen3 fused RMSNorm | f80ae5bd (2025-05-07) | Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models. | No metric; fewer RMSNorm kernels. | | Qwen3 reasoning parser | 015069b0 (2025-05-01) | Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py. | Not measured; faster host parsing. | | Qwen2.5-VL rotary/window pipeline | 67da5720 (2025-05-16); e283976f (2025-09-09) | Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py. | No metrics; eliminates repeated cudaMemcpy and sorting. | | Qwen2/2.5-VL CUDA sync avoidance | 6772bb0f (2025-08-13); 60f0843e (2025-09-08) | Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py. | Not reported; prevents blocking host waits. | | Qwen2.5-VL normalization & SDPA | 02ed8a1f (2025-02-13); cbc8457b (2025-08-07) | Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py. | No metric; more efficient norm/attention. | | Qwen2/2.5-VL init & DP | 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) | Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency. | Not reported; faster init and lower memory. | | Qwen2/2.5-VL attention masks | 47c7126 (2025-03-21) | Restored attention mask precomputation for mixed window/full attention paths. | No metrics; reduces per-forward recompute. | | Qwen2-VL cudaMemcpy reduction | 70b808fe (2025-03-11) | Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention. | Not reported; lowers cudaMemcpyAsync volume. | | Qwen2 FP8 KV cache | da971ec7 (2024-06-19) | Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth. | No metrics; expected memory savings. | | Qwen2 pipeline parallel | 1d2e7fb7 (2024-08-01) | Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs. | No metric; enables scale-out throughput. | | Qwen LoRA punica specialization | 1f567421 (2024-06-21); 8435b207 (2024-05-17) | Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels. | No metrics; better LoRA throughput. | **Kernel Config Tuning (Fused MoE / FP8)** |Commits (Date)|Model / HW Target|Optimization|Perf Metrics| |---|---|---|---| |4d0f2661 (2025-10-20)|Qwen3-30B A3/A3B on H100 (FP8 & BF16)|Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes.|Not reported.| |f96bc364 (2025-10-15)|Qwen3-Next FP8 on H100 TP=2|Introduced TP2-specific FP8 fused_moe config.|Not reported.| |238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07)|Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100|New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage.|Not reported.| |8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28)|Qwen3-Coder-480B-A35B on NVIDIA H20-3e|Added FP8 fused_moe configs for large coder variant.|Not reported.| |2d40665 (2025-06-11)|Qwen3-30B A3B on NVIDIA B200|Introduced B200-specific FP8 fused_moe config.|Not reported.| |22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08)|Qwen3-235B A22B on NVIDIA H20-3e & A100|Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100).|Not reported.| |8fc88d63 (2025-04-28)|Qwen3 MoE (H100/H200/H20 targets)|Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README.|Not reported.| |dcbac4cb (2025-04-28)|Qwen3 dense FP8|Adjusted linear layers so FP8 compatibility works with fused kernels.|Not reported.| |2007d4d5 & f5a3c655 (2025-05-01)|Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X|Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]).|Not reported.| |bd439735 (2024-06-14)|Qwen2-57B-A14B (TP2/TP4)|Tuned fused_moe configs for A100/H100; benchmarks show +18–20% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +18–14% tokens/s.|Throughput gains published in commit message.| Next steps (optional): 1. Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that weren’t reported. 2. For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds. Model optimization methods: - 减少 host/device 之间的 copy - 优化 fused_moe configs,让 Triton 算子在特定 E/N/device/quantization 下更高效 - 提高 CPU/GPU 之间的 overlap --- ## notes Mingxing Zhang: 一个 agent 的不同阶段可能业务需求就不一样,业务需求不同,需要不同 parallelism value = function / cost GPU 利用率已经 70~80%,单一场景空间继续提升很困难,提升多场景下,不同 function 使用不同的 cost,value 最大化 mooncake 适配 vllm 和 sglang 时,工程方法不一样,需要不同的做法,需要分别优化适配推理框架 --- ## Model Optimization Summary ### Kernel fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子,保证 SM 的利用率、且 share memory、registers 不爆掉 GDNAttention for Qwen3-Next ### Data Movement --- Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoint’s architecture while staying compatible with speculative decoding and parallelism features. |Optimization|Match-to-model feature|Performance effect|Evidence/data| |---|---|---|---| |Gated DeltaNet linear-attention backend with fused gating kernels|Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs|Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance|Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo| |Shared Fused MoE with expert parallel load balancing|Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps|Maintains Qwen3 Next’s shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost|Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present| |NextN multi-token predictor (MTP) path|Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner|Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled|MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published| |Mamba-style state management for speculative decode|Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next|Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models|State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied| No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs. Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels. 引入 PD 分离后,EPxDP 时 调度器 影响更大,在线的并行模式调整,可能不如调度器的调整,让每个 rank 收到的 pattern 相对固定