Files
obsidian/projects/auto-tuner/related-works.md

452 lines
56 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Related Work Matrix — Workload Pattern → Serving Engine Config
| Paper | Venue / Year | 主要研究对象 | Workload signals / patterns considered | Optimized knobs / decisions | Objective | 方法类型 | 与“workload→config”关系 | 对 AITuner 的直接启发 | 是否已提出关键 insight |
| ------------------------------------------------------------------------------------------------------ | ------------ | --------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | ------------------------------------------------- | ------------------------------------ | ----------------------- | ------------------------------------------------------------------- | -------------------------------------------- |
| AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving | arXiv 2026 | 多框架 LLM serving config 优化 | production serving workloads模型类型硬件平台SLA/TTFT/TPOT需求 | TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags | latency / throughput / goodput 风格目标 | 性能建模 + 快速配置搜索 | 最直接命中 | 证明“最优 engine config 明显 workload-dependent”而且 config 空间不止 scheduler | 是,且最直接 |
| Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto) | arXiv 2026 | KV cache 分层存储配置 | real-world tracesKV block access patternreuse不同优化目标 | HBM/DRAM/disk tier configeviction/group-specific cache mgmt | latency / throughput / cost Pareto frontier | simulator + pruning + adaptive tuner | 直接命中,但只在 KV/storage 子空间 | 说明 memory-side knobs 必须 workload-aware且天然是多目标优化 | 是,但局限于 KV tiering |
| Serving Generative Large Language Models on Preemptible Instances (SpotServe) | ASPLOS 2024 | 抢占式云环境下的 LLM serving | fluctuating workload实例可用性变化preemption trace | distributed parallelization configmigration strategy | latency / throughput / monetary cost | 系统设计 + 动态重配置 | 部分命中 | 说明 config = f(workload, resource-state),不是只由 workload 决定 | 是,但偏部署/并行配置 |
| Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) | arXiv 2025 | workload-aware runtime adaptation | bursty workloadreal-time memory pressureload surge | layer quantization/swappingKV cache resizing | SLO violations、TTFT、quality-preserving efficiency | runtime adaptive serving | 部分命中 | 说明静态最优 config 不适合 bursty workloadonline adaptive config 很重要 | 是,但偏 runtime morphology |
| Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI 2024 | heterogeneous requests 的动态调度 | heterogeneous / unpredictable requests不同优先级/SLO | request placement、migration、instance-level scheduling | tail latency、priority acceleration、cost saving | 动态 rescheduling + live migration | 邻近,不是 config tuning | 说明高异质 workload 下,先做实例级隔离/迁移可能比局部 knob tuning 更重要 | 提出了 workload→architecture/scheduling insight |
- 1/2
- 1
- 2
| Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request ratephase contentionSLO pressure | prefill/decode partition ratioresource controllerunified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob而且 workload-dependent | 是,但聚焦 phase partition |
| Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client compositiontemporal structuremodel mixmultimodal/reasoninglarge-scale production traces | 不直接优化 config | benchmark realismavoid under-provisioning | characterization + workload generator | 基础支撑,不直接命中 | 说明如果 workload 表征不真实,任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight |
| BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstinessconversation patternsresponse lengthssystem failures | 不直接优化 config | evaluation realismstress realistic serving behavior | trace dataset / characterization | 基础支撑,不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是,提出真实 workload 重要性 |
| Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency levelinteractive vs batch use casemodel size | framework choicevLLM vs TGI | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 |
| Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latencypipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你:未来 workload→config 问题会扩展到 multi-stage不再只看 prefill/decode | 部分提出,多阶段视角 |
我建议你在论文里把它们再压成 4 类
A. Workload characterization / benchmark realism
这些 paper 不直接调 config但决定你的论证是否站得住。
|Paper|你该怎么引用它|
|---|---|
|ServeGen|用来证明:真实 workload 很复杂synthetic trace 容易误导配置结论|
|BurstGPT|用来证明burstiness、conversation pattern、response length 这些 workload feature 不能忽略|
B. Workload-aware scheduling / routing / partitioning
这些不是在调 engine knobs但给了你 workload pattern 的核心结构。
|Paper|核心已知 insight|
|---|---|
|Llumnix|heterogeneous workload 需要跨实例 rescheduling / isolation|
|EWSJF|mixed workload 应先分组/分 regime再优化|
|CascadeInfer|length heterogeneity 是核心 bottleneck|
|Sarathi-Serve|prefill/decode 冲突是第一性瓶颈|
|semi-PD|phase split ratio 应随 workload / SLO 改变|
C. Workload-aware configuration / adaptive resource shaping
这些和你最接近。
|Paper|命中子空间|
|---|---|
|AIConfigurator|通用 serving config search|
|Kareto|KV / storage configuration|
|SpotServe|distributed parallelization configuration|
|MorphServe|online adaptive resource / precision configuration|
D. Framework / pipeline-level coarse configuration
这些告诉你研究边界在哪里。
|Paper|边界意义|
|---|---|
|Comparative vLLM vs TGI|engine choice 本身就是 coarse-grained config|
|HERMES|multi-stage pipeline 会让 workload→config 问题更复杂|
如果你想在 paper 里更突出 gap可以直接用这张“缺口矩阵”
|维度|现有工作覆盖情况|你的空间|
|---|---|---|
|真实 workload characterization|强|不必重复造数据集,但要吸收其 feature 定义|
|workload-aware scheduling|很强|不要和它们正面重合|
|单一子空间 config adaptationKV / parallelism / runtime precision|中等|可以统一这些子空间 insight|
|通用 workload→engine-knob mapping|弱|这是你的主要机会|
|在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化|很弱|这是你最强的位置|
- 1/2
- 1
- 2
我帮你提炼成一句话版 related work positioning
你后面写论文时,可以这么说:
Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.
这句话基本能把你和上面这些 work 的关系讲清楚。
我建议你下一步补一张“feature × knob”矩阵
也就是:
-workload features
- arrival burstiness
- input length mean/variance/tail
- output length mean/variance/tail
- mixed-workload ratio
- prefix/KV locality
- SLO strictness
-config knobs
- max_num_batched_tokens
- scheduler mode
- prefill/decode split
- KV memory fraction
- block size
- parallelism
- disaggregation mode
- cache/tier policy
然后标每篇 paper 实际覆盖了哪些 feature→knob 边。
这个会比 paper matrix 更直接服务你的项目。
如果你要,我下一条就直接给你整理 feature × knob insight matrix。
[ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999)
[Glia: A Human-Inspired AI for Automated Systems Design and Optimization](https://arxiv.org/pdf/2510.27176)
[VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE](https://arxiv.org/pdf/2405.05465)
[DynamoLLM](https://arxiv.org/pdf/2408.00741)profile不同的 TP、workload + searchtarget满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration
[NanoFlow](https://arxiv.org/pdf/2408.12757)Theoretical model + automated pipeline searchsplit sequence / operationtarget最大化 compute/memory/network 利用率)
https://github.com/llm-d/llm-d
AIBrix 实验性的提出了异构 GPU 推理workload 有不同的 size不同的 GPU 适合不同 request size 的甜点区间offline profileonline scheduling异构硬件 serving 时最小化 cost
这存在老生常谈的问题offline profile 与实际 serving 的 gap
https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst
agent related:
- StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017
- ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844
- STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288
SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323
https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning
> Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).
>
> SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.
![[projects/auto-tuner/related-works.figs/260410-105227.png]]
SLO-Aware SchedulingforLarge Language Model Inferences
https://arxiv.org/pdf/2504.14966
https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf
FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
https://arxiv.org/abs/2602.22593
你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过:**定义搜索空间/动作 → 运行试验trial→ 测量目标 → 用算法选择下一步动作**。
- **ATC18Cao et al.**在解释黑箱 auto-tuning 时把机制写得非常标准化_“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”_。
https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf
- **OSDI23Hydro**对超参/配置调优同样给出“工作流定义”:用户指定搜索空间;算法生成 trials系统协调执行直到找出最优配置。
https://www.usenix.org/system/files/osdi23-hu.pdf
- **ATC18Metis**把自己定位为“black-box optimization service”在真实生产系统上调参并以 tail latency 为主要评估指标。
https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf
- **OSDI18µTune**属于“在线自适应”流派,但它仍然是同一闭环:监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。
https://www.usenix.org/system/files/osdi18-sriraman.pdf
- **SOSP21POP**是“数学优化/求解器”流派,它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式,并讨论求解速度与 SLA 的权衡。
https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf
**结论**:无论是黑箱搜索、在线控制、强化学习、还是数学规划,顶会系统优化基本都落在同一个元结构上:
> **在预算与约束下,围绕某个“可执行试验”的闭环,迭代地产生动作并更新策略。**
---
# related works
[DynamoLLM](https://arxiv.org/pdf/2408.00741)profile不同的 TP、workload + searchtarget满足 SLO 下最小化能耗) + dynamic 调整 parallelism configuration
[NanoFlow](https://arxiv.org/pdf/2408.12757)Theoretical model + automated pipeline searchsplit sequence / operationtarget最大化 compute/memory/network 利用率)
[MorphServe](https://arxiv.org/pdf/2506.02006v1)
[Autocomp](https://arxiv.org/pdf/2505.18574v3)
# Questions
1. [Mooncake 访谈](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK) 10:00 处业务场景变得极为丰富需要不同的配置PD 分离时几P几DP 和 D 节点内分别跑什么样的并行模式都需要调优Qwen 线上是否有明确的需要不同配置的需求?或者目前 Qwen 的现状还是根据人工测试调整,找到一个整体相对比较优的配置固定作为一个模型的配置,并不涉及线上的负载感知与动态调整?我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置?还是有一些更好的工作流?
2. Qwen 系列模型是否已经上线 EP如果上线了 EP为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案?我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能?
3. Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化,可以认为这是一个普遍的趋势吗?
---
# Backup
> 说明:这里的“自动优化”聚焦**自动化调度/并行/批处理/能耗-成本/KV-Cache 管理**等系统层机制,而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张(以作者给出的 headline 为准)。
### 速览对照表
| 工作(年份/来源) | 一句话核心 | 关键自动化机制 | 公开性能主张(作者给出) | 与相近工作的差异点 | 在 vLLM / SGLang 落地 |
| ----------------------------------------------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------------- | ---------------------------------------------- | ---------------------------------------------------------- |
| **DynamoLLM** (2024, HPCA25) ([arXiv][1]) | 面向**集群级**的能耗/成本最优化:在满足 SLO 的前提下**自动重构**推理集群配置 | 负载-功耗感知的**动态集群重构**、分层控制 | 节能 ~5253%,运营碳减 38%,成本降 61%SLO 保持 | 目标是**服务级/集群级**运行点最优化,而非 GPU 内核/批处理细节 | 与 vLLM/SGLang **并行部署**(外部编排层),非内嵌;可用作上层资源控制器 |
| **NanoFlow** (2024→2025) ([arXiv][2]) | 在**单 GPU**上把一次请求切成多个**nano-batches**并自动搜寻**并行-重叠流水线** | **自动搜索**nano-batch 的数量/大小/顺序/资源配额;算子共调度以重叠算力/显存/通信 | 对多基线vLLM、FastGen、TRT-LLM最高 **1.91×** 吞吐提升 | 强调**设备内**并行与流水重叠;对 decode 轻/ prefll 重的结构做细粒度流水 | 目前为**独立运行时**(开源实现),未并入 vLLM/SGLang 主干;可作替代后端 ([GitHub][3]) |
| **Sarathi & Sarathi-Serve** (20232024) ([arXiv][4]) | 把长**prefill**切块,并与**decode**混合成连续混批,减少流水“气泡” | **Chunked prefill** + **continuous hybrid batching**decode piggyback | 端到端最高 **1.91×**decode 吞吐最高 **10×** | 最早系统化提出“**prefill-decode 混合**一批”的通用策略 | vLLM/SGLang 均已支持/吸收该思路(见下文“落地”) ([VLLM Documentation][5]) |
| **DeepSpeed-FastGen** (2024) ([arXiv][6]) | **Blocked KV** + **Dynamic SplitFuse 连续批处理**,兼顾低延迟与高吞吐 | 动态拆分/融合批次KV 分块复用 | 多模型/硬件上优于 vLLM作者报告 | 方案与 vLLM 的 PagedAttention 路线相近但实现不同 | 作为**独立后端**;与 vLLM/SGLang 互为替代 |
| **POD-Attention** (ASPLOS25) ([Microsoft][7]) | 追求**prefill 与 decode 的全重叠**,降低 Token Break TimeTBT | Prefill/Decode **完全并行化调度**与相容内核 | 在长上下文/高 TBT 场景显著降停顿(定性+量化) | 把 Sarathi 的“混合批”更推进到**近全重叠** | 学术原型;思想可迁移到 vLLM/SGLang 的调度/内核层 |
| **Fluid-Guided Online SchedulingWAIT/Nested WAIT** (2025, SSRN) ([SSRN][8]) | 把 LLM 推理抽象为**带 KV 内存约束**的多阶段**在线调度**,给出近似最优策略 | 基于流体模型的**在线批量与内存配给**决策 | 实验优于 vLLM/Sarathi 的吞吐/时延(作者报告) | 强理论导向,**内存-批量-时延**三者联动的在线算法 | 研究性,尚未并入主流实现 |
| **Memory-aware 动态批处理** (2025) ([arXiv][9]) | 运行时监控**显存与 SLA****自适应**批大小与解码过程 | 显存感知的批调度 + 延迟反馈回路 | 在 Llama-7B+A100 上优于固定超参的吞吐/时延 | 更工程化的**在线批超参**调节方法 | 思路与 vLLM/SGLang 的自调度兼容;尚无主干合并记录 |
| **HyGen** (2025) ([arXiv][10]) | **线上/离线**融合:两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力 | 两阶段 SLO-aware 调度与隔离 | 在离线共置下维持在线延迟并提升整体利用率 | 关注**业务混部**而非单一推理吞吐 | 可与任一后端并行部署;非 vLLM/SGLang 内核改动 |
| **PrefillOnly** (2025) ([arXiv][11]) | 针对**只需 1 token 输出**的“prefill-only”工作负载做**极简 KV**与路径优化 | 仅保留**最后一层** KV、轻量运行路径 | 对检索/分类式工作负载显著提速降时延 | 面向特定负载类型的**路径裁剪** | 可作为特型后端/路径对接主流引擎 |
| **vLLM / PagedAttention** (2023→) ([arXiv][12]) | **分页化 KV-Cache** + 预占式调度,近零碎片、易做连续批处理 | 块级内存管理、连续批、请求抢占 | 对 HF baseline 最高 **24×** 吞吐(早期报告) | 率先把**内存分页**与**连续批**标准化 | 已成为事实上的开源主流服务引擎之一 |
| **Throughput-Optimal Scheduling for LLM Serving** (2025) ([arXiv][13]) | 从理论上给**连续批处理**的吞吐上界/最优策略 | 令牌级排队/匹配策略 | 给出与实践接轨的理论最优性结果 | 偏理论基线,指导工程实现 | 待工程化吸收 |
| **Learning-to-Rank Scheduler** (2024) | 预测请求**相对长度排序**,更逼近 SJF 以降延迟/提吞吐 | LTR 训练的调度器,按预测顺序排队/并批 | 相比现有基线更优的延迟/完成时间 | 与两者同属**自动排队/批形成** | 可接入 vLLM/SGLang 的 admission/排队阶段。 ([arXiv][5]) |
| **Online Scheduling with KV Constraints** (2025) | 把 LLM 推理抽象为**带 KV 内存约束**的在线调度,给出近优策略 | 流体/队列模型 + 在线批/KV 决策 | 优于 vLLM/Sarathi 基线(作者报告) | 与两者同属**自动在线调度**理论化 | 可做 vLLM/SGLang 的**策略插件**(需低开销遥测)。 ([arXiv][8]) |
| **Fairness-Aware Batch Formation** (2025.10) | 在**连续/混合批**下自动平衡新/旧请求的“计算公平性”与吞吐 | 批内配额/重排策略 | 在保持吞吐下显著改善不公平 | 同属批形成自动化(与 Sarathi/Chunked 互补) | 可改造 vLLM/SGLang 的 batcher。 ([arXiv][10]) |
| **DriftPD-Multiplexing** (2025) | **相位解耦复用**:预填与解码相位分离并“就地”复用,缓解吞吐-时延拉扯 | 相位解耦 + in-place compute 复用 | 在多负载下同时提吞吐与守 SLO | 与 NanoFlow/POD 同属**相位级重叠**路线 | 需要内核/调度联动,适合作为引擎深改方向。 ([Han Zhao 赵涵][11]) |
[1]: https://arxiv.org/html/2408.00741v1?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."
[2]: https://arxiv.org/abs/2408.12757?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
[3]: https://github.com/efeslab/Nanoflow?utm_source=chatgpt.com "efeslab/Nanoflow: A throughput-oriented high-performance ..."
[4]: https://arxiv.org/abs/2308.16369?utm_source=chatgpt.com "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills"
[5]: https://docs.vllm.ai/en/v0.4.2/models/performance.html?utm_source=chatgpt.com "Performance and Tuning - vLLM"
[6]: https://arxiv.org/pdf/2401.08671?utm_source=chatgpt.com "DeepSpeed-FastGen: High-throughput Text Generation for ..."
[7]: https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/POD-Attention-ASPLOS25.pdf?utm_source=chatgpt.com "POD-Attention: Unlocking Full Prefill-Decode Overlap for ..."
[8]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5195463&utm_source=chatgpt.com "Optimizing LLM Inference: Fluid-Guided Online Scheduling ..."
[9]: https://arxiv.org/pdf/2503.05248?utm_source=chatgpt.com "Optimizing LLM Inference Throughput via Memory-aware ..."
[10]: https://arxiv.org/html/2501.14808v2?utm_source=chatgpt.com "1 Introduction"
[11]: https://arxiv.org/html/2505.07203v1?utm_source=chatgpt.com "PrefillOnly: An Inference Engine for Prefill-only Workloads ..."
[12]: https://arxiv.org/pdf/2309.06180?utm_source=chatgpt.com "Efficient Memory Management for Large Language Model ..."
[13]: https://arxiv.org/html/2504.07347v1?utm_source=chatgpt.com "Throughput-Optimal Scheduling Algorithms for LLM ..."
[14]: https://homes.cs.washington.edu/~arvind/papers/nanoflow.pdf?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
[15]: https://discuss.vllm.ai/t/does-the-vllm-v1-support-speculative-decoding-now/191?utm_source=chatgpt.com "Does the vLLM v1 support Speculative Decoding now?"
[16]: https://github.com/sgl-project/sglang/issues/2273?utm_source=chatgpt.com "[Kernel] Launch two kernels for mixed chunked prefill #2273"
[17]: https://github.com/sgl-project/sglang/issues/6553?utm_source=chatgpt.com "[PD] Support Multi-Process for TokenizerManager #6553"
[18]: https://arxiv.org/html/2312.07104v1?utm_source=chatgpt.com "Efficiently Programming Large Language Models using ..."
[19]: https://iacoma.cs.uiuc.edu/iacoma-papers/hpca25_2.pdf?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."
## Qwen optimization
Catalogued how Alibabas Qwen-family models have been tuned across the codebase.
**Optimization Matrix**
| Model scope | Optimization & commits | Why it was added | Feature / hardware fit |
| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| Qwen3-VL | Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443 | Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids | Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput |
| Qwen3-VL | Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296 | Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation | Sustains high-frame video pipelines and DP sharding without CUDA synchronizations |
| Qwen3-VL | Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420 | Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability | Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements |
| Qwen2.5-VL | Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814 | Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback | Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs |
| Qwen2.5-VL | O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897 | Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously | Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads |
| Qwen2-VL | Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805 | Allows disabling TP in favor of DP for the encoder and caches attention metadata | Optimizes multi-GPU deployments handling many images per request |
| Qwen2/2.5-VL startup | Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360 | Avoids generating huge fake inputs to find token limits, slashing initialization time | Keeps autoregressive limits aligned with true max resolution / frame counts before serving |
| Qwen series (CUDA & ROCm) | Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274 | Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback | Maintains high-throughput rotary embedding on both CUDA and ROCm stacks |
| Qwen2 text & MoE | FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1 | Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels | Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4 |
| Qwen3 MoE family (Next / Coder / Thinking) | Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 | Provides block-size / warp tuning for each TP×EP layout and precision mix | Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning |
Key observations
- Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
- Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
- Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.
Next steps
1. Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
2. Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.
| Model | Optimization | Why it was added | Feature / hardware fit |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- |
| Qwen3Next | SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) | Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly | Hybrid MoE layers with shared experts on DP/EP deployments |
| Qwen3Next | Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342) | Guarantees tokens are sharded across TP ranks before expert dispatch so TP×EP + DeepEP doesnt repeat work | Large-TG throughput setups using TP+EP or DeepEP all-to-all |
| Qwen3Next | Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59) | Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills | Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding |
| Qwen3Next | Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471) | Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges | Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths |
| Qwen3Next | Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306) | Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device() | Unlocks ROCm and future device backends without code forks |
| Qwen3Next MoE configs | Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1) | Ships per-topology block sizes and warp counts that match Alibabas published sweeps | Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper |
| Qwen3 dense | Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199) | Exposes transformers dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up | Long-context serving (>128K) on dense Qwen3 checkpoints |
| Qwen3MoE | Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192) | Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls | TP×EP launches with DeepEP / allgather-reducescatter all-to-all |
| Qwen3MoE | Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167) | Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer | Any MoE run with TP>1, especially bandwidth-bound clusters |
| Qwen3MoE | FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021) | Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded | Delivers FP8 throughput on Triton / Cutlass backends for 64128 experts |
| Qwen3MoE | Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467) | Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban | Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models |
Highlights & context
- The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
- SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
- Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibabas long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.
Next steps
1. Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
2. If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.
Evidence: when the team first tuned Qwen2s 57B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53req/s (11058 tok/s) to 12.47req/s (13089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with handpicked block and warp sizes for that GPU. The newer GB200 FP8 tables youre looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance thats why they use more aggressive BLOCK_SIZE_N and higher num_stages.
---
**Runtime Optimizations**
| Feature / Area | Commits (Date) | Optimization Highlights | Perf Notes |
| --------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| Qwen2/2.5-VL vision attention rearrange | 5c2acb27 (2025-10-18) | Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py). | Not reported; removes extra tensor copies. |
| Qwen3-next gated RMSNorm | 82e64c7 (2025-10-12) | Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py). | No metric; fewer kernel launches and better occupancy. |
| Qwen3-next MTP bool-mask handling | 785d8b6 (2025-10-16) | Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction. | Not stated; expected higher MTP throughput. |
| Qwen3-VL fast_pos_embed_interpolate | 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12) | Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py. | No metric; reduces Python overhead for large image/video grids. |
| Qwen3-VL multimodal tensor prep | 0426e3c5 & 2c1c7dfb (2025-10-09) | _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel. | Not reported; fewer allocations/kernels. |
| Qwen3-VL deepstack splitting | 1dfea5f4 (2025-09-19) | Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. | No metric; lighter host-side work. |
| Qwen3-VL interleaved MRoPE | cea91a32 (2025-09-19) | Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py). | Not reported; avoids slow native fallback. |
| Qwen3Moe fused output | 4f510bc2 (2025-08-20) | Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py. | Not stated; less collective traffic. |
| Qwen3-next FP8 checkpoints | ef7eefe1 (2025-09-18) | Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py. | No metric; enables FP8 inference path. |
| Qwen3 FP8 accuracy guard | a258ad8b (2025-08-17) | Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py). | Not reported; quality fix to keep FP8 speed viable. |
| Qwen3 fused RMSNorm | f80ae5bd (2025-05-07) | Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models. | No metric; fewer RMSNorm kernels. |
| Qwen3 reasoning parser | 015069b0 (2025-05-01) | Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py. | Not measured; faster host parsing. |
| Qwen2.5-VL rotary/window pipeline | 67da5720 (2025-05-16); e283976f (2025-09-09) | Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py. | No metrics; eliminates repeated cudaMemcpy and sorting. |
| Qwen2/2.5-VL CUDA sync avoidance | 6772bb0f (2025-08-13); 60f0843e (2025-09-08) | Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py. | Not reported; prevents blocking host waits. |
| Qwen2.5-VL normalization & SDPA | 02ed8a1f (2025-02-13); cbc8457b (2025-08-07) | Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py. | No metric; more efficient norm/attention. |
| Qwen2/2.5-VL init & DP | 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) | Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency. | Not reported; faster init and lower memory. |
| Qwen2/2.5-VL attention masks | 47c7126 (2025-03-21) | Restored attention mask precomputation for mixed window/full attention paths. | No metrics; reduces per-forward recompute. |
| Qwen2-VL cudaMemcpy reduction | 70b808fe (2025-03-11) | Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention. | Not reported; lowers cudaMemcpyAsync volume. |
| Qwen2 FP8 KV cache | da971ec7 (2024-06-19) | Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth. | No metrics; expected memory savings. |
| Qwen2 pipeline parallel | 1d2e7fb7 (2024-08-01) | Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs. | No metric; enables scale-out throughput. |
| Qwen LoRA punica specialization | 1f567421 (2024-06-21); 8435b207 (2024-05-17) | Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels. | No metrics; better LoRA throughput. |
**Kernel Config Tuning (Fused MoE / FP8)**
|Commits (Date)|Model / HW Target|Optimization|Perf Metrics|
|---|---|---|---|
|4d0f2661 (2025-10-20)|Qwen3-30B A3/A3B on H100 (FP8 & BF16)|Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes.|Not reported.|
|f96bc364 (2025-10-15)|Qwen3-Next FP8 on H100 TP=2|Introduced TP2-specific FP8 fused_moe config.|Not reported.|
|238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07)|Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100|New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage.|Not reported.|
|8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28)|Qwen3-Coder-480B-A35B on NVIDIA H20-3e|Added FP8 fused_moe configs for large coder variant.|Not reported.|
|2d40665 (2025-06-11)|Qwen3-30B A3B on NVIDIA B200|Introduced B200-specific FP8 fused_moe config.|Not reported.|
|22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08)|Qwen3-235B A22B on NVIDIA H20-3e & A100|Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100).|Not reported.|
|8fc88d63 (2025-04-28)|Qwen3 MoE (H100/H200/H20 targets)|Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README.|Not reported.|
|dcbac4cb (2025-04-28)|Qwen3 dense FP8|Adjusted linear layers so FP8 compatibility works with fused kernels.|Not reported.|
|2007d4d5 & f5a3c655 (2025-05-01)|Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X|Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]).|Not reported.|
|bd439735 (2024-06-14)|Qwen2-57B-A14B (TP2/TP4)|Tuned fused_moe configs for A100/H100; benchmarks show +1820% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +1814% tokens/s.|Throughput gains published in commit message.|
Next steps (optional):
1. Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that werent reported.
2. For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.
Model optimization methods:
- 减少 host/device 之间的 copy
- 优化 fused_moe configs让 Triton 算子在特定 E/N/device/quantization 下更高效
- 提高 CPU/GPU 之间的 overlap
---
## notes
Mingxing Zhang:
一个 agent 的不同阶段可能业务需求就不一样,业务需求不同,需要不同 parallelism
value = function / cost
GPU 利用率已经 70~80%,单一场景空间继续提升很困难,提升多场景下,不同 function 使用不同的 costvalue 最大化
mooncake 适配 vllm 和 sglang 时,工程方法不一样,需要不同的做法,需要分别优化适配推理框架
---
## Model Optimization Summary
### Kernel
fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子,保证 SM 的利用率、且 share memory、registers 不爆掉
GDNAttention for Qwen3-Next
### Data Movement
---
Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoints architecture while staying compatible with speculative decoding and parallelism features.
|Optimization|Match-to-model feature|Performance effect|Evidence/data|
|---|---|---|---|
|Gated DeltaNet linear-attention backend with fused gating kernels|Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs|Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance|Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo|
|Shared Fused MoE with expert parallel load balancing|Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps|Maintains Qwen3 Nexts shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost|Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present|
|NextN multi-token predictor (MTP) path|Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner|Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled|MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published|
|Mamba-style state management for speculative decode|Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next|Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models|State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied|
No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.
Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.
引入 PD 分离后EPxDP 时 调度器 影响更大,在线的并行模式调整,可能不如调度器的调整,让每个 rank 收到的 pattern 相对固定