Related Work Matrix — Workload Pattern → Serving Engine Config

| Paper                                                                                                  | Venue / Year | 主要研究对象                            | Workload signals / patterns considered                 | Optimized knobs / decisions                                                               | Objective                                         | 方法类型                                 | 与“workload→config”关系    | 对 AITuner 的直接启发                                                     | 是否已提出关键 insight                              |
| ------------------------------------------------------------------------------------------------------ | ------------ | --------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | ------------------------------------------------- | ------------------------------------ | ----------------------- | ------------------------------------------------------------------- | -------------------------------------------- |
| AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving              | arXiv 2026   | 多框架 LLM serving config 优化         | production serving workloads；模型类型；硬件平台；SLA/TTFT/TPOT需求 | TP/PP/EP、CUDA graphs、KV-cache memory fraction、max token capacity、framework-specific flags | latency / throughput / goodput 风格目标               | 性能建模 + 快速配置搜索                        | 最直接命中                   | 证明“最优 engine config 明显 workload-dependent”，而且 config 空间不止 scheduler | 是，且最直接                                       |
| Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service (Kareto)             | arXiv 2026   | KV cache 分层存储配置                   | real-world traces；KV block access pattern；reuse；不同优化目标 | HBM/DRAM/disk tier config；eviction/group-specific cache mgmt                              | latency / throughput / cost Pareto frontier       | simulator + pruning + adaptive tuner | 直接命中，但只在 KV/storage 子空间 | 说明 memory-side knobs 必须 workload-aware，且天然是多目标优化                    | 是，但局限于 KV tiering                            |
| Serving Generative Large Language Models on Preemptible Instances (SpotServe)                          | ASPLOS 2024  | 抢占式云环境下的 LLM serving              | fluctuating workload；实例可用性变化；preemption trace          | distributed parallelization config；migration strategy                                     | latency / throughput / monetary cost              | 系统设计 + 动态重配置                         | 部分命中                    | 说明 config = f(workload, resource-state)，不是只由 workload 决定            | 是，但偏部署/并行配置                                  |
| Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing (MorphServe) | arXiv 2025   | workload-aware runtime adaptation | bursty workload；real-time memory pressure；load surge   | layer quantization/swapping；KV cache resizing                                             | SLO violations、TTFT、quality-preserving efficiency | runtime adaptive serving             | 部分命中                    | 说明静态最优 config 不适合 bursty workload，online adaptive config 很重要        | 是，但偏 runtime morphology                      |
| Llumnix: Dynamic Scheduling for Large Language Model Serving                                           | OSDI 2024    | heterogeneous requests 的动态调度      | heterogeneous / unpredictable requests；不同优先级/SLO       | request placement、migration、instance-level scheduling                                     | tail latency、priority acceleration、cost saving    | 动态 rescheduling + live migration     | 邻近，不是 config tuning     | 说明高异质 workload 下，先做实例级隔离/迁移可能比局部 knob tuning 更重要                    | 提出了 workload→architecture/scheduling insight |

- 1/2

- 1
- 2

| Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage (semi-PD) | arXiv 2025 | phase-wise 资源划分 | request rate；phase contention；SLO pressure | prefill/decode partition ratio；resource controller；unified storage | latency、SLO attainment | 系统设计 + dynamic partitioning | 部分命中 | 说明 phase split ratio 本身就是重要 config knob，而且 workload-dependent | 是，但聚焦 phase partition |  
| Workload Characterization and Generation of Large Language Model Serving in Production (ServeGen) | arXiv 2025 | 真实 workload characterization | per-client composition；temporal structure；model mix；multimodal/reasoning；large-scale production traces | 不直接优化 config | benchmark realism；avoid under-provisioning | characterization + workload generator | 基础支撑，不直接命中 | 说明如果 workload 表征不真实，任何 workload→config 规律都可能失真 | 提出了 workload realism 的关键 insight |  
| BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems | arXiv / KDD | 真实 serving traces 数据集 | burstiness；conversation patterns；response lengths；system failures | 不直接优化 config | evaluation realism；stress realistic serving behavior | trace dataset / characterization | 基础支撑，不直接命中 | 说明 burstiness、会话结构、output-length joint stats 都是重要 workload features | 是，提出真实 workload 重要性 |  
| Comparative Analysis of Large Language Model Inference Serving Systems: vLLM and HuggingFace TGI | arXiv 2025 | serving framework 对比 | concurrency level；interactive vs batch use case；model size | framework choice（vLLM vs TGI） | throughput、tail latency、memory、scalability | empirical comparison | 粗粒度命中 | 说明 engine choice 本身就是 coarse-grained config decision | 部分提出 |  
| Understanding and Optimizing Multi-Stage AI Inference Pipelines (HERMES) | arXiv 2025 | multi-stage inference pipeline | reasoning, RAG, KV retrieval, routing, heterogeneous hardware, stage diversity | stage-specific batching / HW-SW design choices | end-to-end latency；pipeline optimization | simulator / design-space exploration | 扩展边界相关 | 提醒你：未来 workload→config 问题会扩展到 multi-stage，不再只看 prefill/decode | 部分提出，多阶段视角 |

我建议你在论文里把它们再压成 4 类

A. Workload characterization / benchmark realism

这些 paper 不直接调 config，但决定你的论证是否站得住。

|Paper|你该怎么引用它|
|---|---|
|ServeGen|用来证明：真实 workload 很复杂，synthetic trace 容易误导配置结论|
|BurstGPT|用来证明：burstiness、conversation pattern、response length 这些 workload feature 不能忽略|

B. Workload-aware scheduling / routing / partitioning

这些不是在调 engine knobs，但给了你 workload pattern 的核心结构。

|Paper|核心已知 insight|
|---|---|
|Llumnix|heterogeneous workload 需要跨实例 rescheduling / isolation|
|EWSJF|mixed workload 应先分组/分 regime，再优化|
|CascadeInfer|length heterogeneity 是核心 bottleneck|
|Sarathi-Serve|prefill/decode 冲突是第一性瓶颈|
|semi-PD|phase split ratio 应随 workload / SLO 改变|

C. Workload-aware configuration / adaptive resource shaping

这些和你最接近。

|Paper|命中子空间|
|---|---|
|AIConfigurator|通用 serving config search|
|Kareto|KV / storage configuration|
|SpotServe|distributed parallelization configuration|
|MorphServe|online adaptive resource / precision configuration|

D. Framework / pipeline-level coarse configuration

这些告诉你研究边界在哪里。

|Paper|边界意义|
|---|---|
|Comparative vLLM vs TGI|engine choice 本身就是 coarse-grained config|
|HERMES|multi-stage pipeline 会让 workload→config 问题更复杂|

如果你想在 paper 里更突出 gap，可以直接用这张“缺口矩阵”

|维度|现有工作覆盖情况|你的空间|
|---|---|---|
|真实 workload characterization|强|不必重复造数据集，但要吸收其 feature 定义|
|workload-aware scheduling|很强|不要和它们正面重合|
|单一子空间 config adaptation（KV / parallelism / runtime precision）|中等|可以统一这些子空间 insight|
|通用 workload→engine-knob mapping|弱|这是你的主要机会|
|在 length-dependent TTFT + TPOT + goodput 统一目标下的配置优化|很弱|这是你最强的位置|

- 1/2

- 1
- 2

我帮你提炼成一句话版 related work positioning

你后面写论文时，可以这么说：

Prior work has extensively studied realistic workload characterization and workload-aware scheduling for LLM serving, and several recent systems have explored workload-dependent adaptation in specific configuration subspaces such as parallelization, KV-cache tiering, and runtime precision scaling. However, a general and interpretable mapping from workload patterns to serving-engine configurations remains underexplored, especially under length-dependent TTFT and TPOT constraints with SLO-constrained goodput as the optimization target.

这句话基本能把你和上面这些 work 的关系讲清楚。

我建议你下一步补一张“feature × knob”矩阵

也就是：  

- 行：workload features
    

- arrival burstiness
    

- input length mean/variance/tail
    

- output length mean/variance/tail
    

- mixed-workload ratio
    

- prefix/KV locality
    

- SLO strictness
    

- 列：config knobs
    

- max_num_batched_tokens
    

- scheduler mode
    

- prefill/decode split
    

- KV memory fraction
    

- block size
    

- parallelism
    

- disaggregation mode
    

- cache/tier policy
    

然后标每篇 paper 实际覆盖了哪些 feature→knob 边。  
  
这个会比 paper matrix 更直接服务你的项目。  
  
如果你要，我下一条就直接给你整理 feature × knob insight matrix。


[ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999)

[Glia: A Human-Inspired AI for Automated Systems Design and Optimization](https://arxiv.org/pdf/2510.27176)
[VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE](https://arxiv.org/pdf/2405.05465)

[DynamoLLM](https://arxiv.org/pdf/2408.00741)：profile（不同的 TP、workload） + search（target：满足 SLO 下最小化能耗） + dynamic 调整 parallelism configuration
[NanoFlow](https://arxiv.org/pdf/2408.12757)：Theoretical model + automated pipeline search（split sequence / operation，target：最大化 compute/memory/network 利用率）


https://github.com/llm-d/llm-d

AIBrix 实验性的提出了异构 GPU 推理，workload 有不同的 size，不同的 GPU 适合不同 request size 的甜点区间，offline profile，online scheduling，异构硬件 serving 时最小化 cost
这存在老生常谈的问题：offline profile 与实际 serving 的 gap
https://github.com/vllm-project/aibrix/blob/main/docs/source/features/heterogeneous-gpu.rst


agent related:
- StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems https://arxiv.org/pdf/2510.25017
- ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training https://arxiv.org/pdf/2511.03844
- STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems https://dl.acm.org/doi/epdf/10.1145/3712285.3759887

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving https://arxiv.org/pdf/2601.06288

SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines https://arxiv.org/pdf/2408.04323
https://github.com/antgroup/SCOOT-SLO-Oriented-Performance-Tuning

> Request traces are collected from four LLM inference services at Ant Group, including applications of text-to-SQL (SQL), chatbot (BOT), classification (CLS), and recommendation (REC).
> 
> SCOOT uses requests with the longest 50% of output lengths for stress testing. The input and output lengths of these requests are presented in Fig.

![[projects/auto-tuner/related-works.figs/260410-105227.png]]

SLO-Aware SchedulingforLarge Language Model Inferences
https://arxiv.org/pdf/2504.14966


https://ut-sysml.ece.utexas.edu/publications/prints/sc2025_liakopoulos.pdf


FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
https://arxiv.org/abs/2602.22593


你想要的“统一范式”其实在多篇顶会论文里被非常直白地描述过：**定义搜索空间/动作 → 运行试验（trial）→ 测量目标 → 用算法选择下一步动作**。

- **ATC’18（Cao et al.）**在解释黑箱 auto-tuning 时，把机制写得非常标准化：_“迭代尝试不同配置、测量目标函数、基于已学信息选择下一批配置”_。
https://www.usenix.org/system/files/conference/atc18/atc18-cao.pdf

- **OSDI’23（Hydro）**对超参/配置调优同样给出“工作流定义”：用户指定搜索空间；算法生成 trials；系统协调执行直到找出最优配置。
https://www.usenix.org/system/files/osdi23-hu.pdf
	
- **ATC’18（Metis）**把自己定位为“black-box optimization service”，在真实生产系统上调参并以 tail latency 为主要评估指标。
https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf
    
- **OSDI’18（µTune）**属于“在线自适应”流派，但它仍然是同一闭环：监控/估计负载 → 用模型预测每个配置下的 tail latency → 切换到预测最优的配置。
https://www.usenix.org/system/files/osdi18-sriraman.pdf
    
- **SOSP’21（POP）**是“数学优化/求解器”流派，它甚至把系统资源分配问题明确定义为“可写成数学优化问题”的形式，并讨论求解速度与 SLA 的权衡。
   https://people.eecs.berkeley.edu/~matei/papers/2021/sosp_pop.pdf 

**结论**：无论是黑箱搜索、在线控制、强化学习、还是数学规划，顶会系统优化基本都落在同一个元结构上：

> **在预算与约束下，围绕某个“可执行试验”的闭环，迭代地产生动作并更新策略。**

---
# related works

[DynamoLLM](https://arxiv.org/pdf/2408.00741)：profile（不同的 TP、workload） + search（target：满足 SLO 下最小化能耗） + dynamic 调整 parallelism configuration
[NanoFlow](https://arxiv.org/pdf/2408.12757)：Theoretical model + automated pipeline search（split sequence / operation，target：最大化 compute/memory/network 利用率）
[MorphServe](https://arxiv.org/pdf/2506.02006v1)
[Autocomp](https://arxiv.org/pdf/2505.18574v3)


# Questions

1. [Mooncake 访谈](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK) 10:00 处，业务场景变得极为丰富，需要不同的配置，PD 分离时几P几D，P 和 D 节点内分别跑什么样的并行模式都需要调优，Qwen 线上是否有明确的需要不同配置的需求？或者目前 Qwen 的现状还是根据人工测试调整，找到一个整体相对比较优的配置固定作为一个模型的配置，并不涉及线上的负载感知与动态调整？我认为阿里这里有「异构硬件 x 多模型 x 多样的负载」线上是否需要对每个维度都人工 finetune 找一个好的推理配置？还是有一些更好的工作流？
2. Qwen 系列模型是否已经上线 EP？如果上线了 EP，为什么不使用 DeepSeek 的 DBO (dual batch overlap) 方案？我理解 DBO 在 EP 的场景总是能做到一定程度的 overlap 计算和通信从而提高性能？
3. Qwen-Next 可以看到针对 linear-attention 的优化 (GDNAttention)、针对超大稀疏 experts 的 EP 优化，可以认为这是一个普遍的趋势吗？


---
# Backup


> 说明：这里的“自动优化”聚焦**自动化调度/并行/批处理/能耗-成本/KV-Cache 管理**等系统层机制，而非单纯模型改造或手工参数调优。每条给出一句话核心观点与公开报告的性能主张（以作者给出的 headline 为准）。

### 速览对照表

| 工作（年份/来源）                                                                     | 一句话核心                                                  | 关键自动化机制                                                                | 公开性能主张（作者给出）                                | 与相近工作的差异点                                      | 在 vLLM / SGLang 落地                                         |
| ----------------------------------------------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------------------------- | ------------------------------------------- | ---------------------------------------------- | ---------------------------------------------------------- |
| **DynamoLLM** (2024, HPCA’25) ([arXiv][1])                                    | 面向**集群级**的能耗/成本最优化：在满足 SLO 的前提下**自动重构**推理集群配置          | 负载-功耗感知的**动态集群重构**、分层控制                                                | 节能 ~52–53%，运营碳减 38%，成本降 61%，SLO 保持          | 目标是**服务级/集群级**运行点最优化，而非 GPU 内核/批处理细节           | 与 vLLM/SGLang **并行部署**（外部编排层），非内嵌；可用作上层资源控制器               |
| **NanoFlow** (2024→2025) ([arXiv][2])                                         | 在**单 GPU**上把一次请求切成多个**nano-batches**并自动搜寻**并行-重叠流水线**  | **自动搜索**nano-batch 的数量/大小/顺序/资源配额；算子共调度以重叠算力/显存/通信                     | 对多基线（vLLM、FastGen、TRT-LLM）最高 **1.91×** 吞吐提升 | 强调**设备内**并行与流水重叠；对 decode 轻/ prefll 重的结构做细粒度流水 | 目前为**独立运行时**（开源实现），未并入 vLLM/SGLang 主干；可作替代后端 ([GitHub][3]) |
| **Sarathi & Sarathi-Serve** (2023–2024) ([arXiv][4])                          | 把长**prefill**切块，并与**decode**混合成连续混批，减少流水“气泡”           | **Chunked prefill** + **continuous hybrid batching**（decode piggyback） | 端到端最高 **1.91×**；decode 吞吐最高 **10×**         | 最早系统化提出“**prefill-decode 混合**一批”的通用策略          | vLLM/SGLang 均已支持/吸收该思路（见下文“落地”） ([VLLM Documentation][5])  |
| **DeepSpeed-FastGen** (2024) ([arXiv][6])                                     | **Blocked KV** + **Dynamic SplitFuse 连续批处理**，兼顾低延迟与高吞吐 | 动态拆分/融合批次，KV 分块复用                                                      | 多模型/硬件上优于 vLLM（作者报告）                        | 方案与 vLLM 的 PagedAttention 路线相近但实现不同            | 作为**独立后端**；与 vLLM/SGLang 互为替代                              |
| **POD-Attention** (ASPLOS’25) ([Microsoft][7])                                | 追求**prefill 与 decode 的全重叠**，降低 Token Break Time（TBT）   | Prefill/Decode **完全并行化调度**与相容内核                                        | 在长上下文/高 TBT 场景显著降停顿（定性+量化）                  | 把 Sarathi 的“混合批”更推进到**近全重叠**                   | 学术原型；思想可迁移到 vLLM/SGLang 的调度/内核层                            |
| **Fluid-Guided Online Scheduling（WAIT/Nested WAIT）** (2025, SSRN) ([SSRN][8]) | 把 LLM 推理抽象为**带 KV 内存约束**的多阶段**在线调度**，给出近似最优策略          | 基于流体模型的**在线批量与内存配给**决策                                                 | 实验优于 vLLM/Sarathi 的吞吐/时延（作者报告）              | 强理论导向，**内存-批量-时延**三者联动的在线算法                    | 研究性，尚未并入主流实现                                               |
| **Memory-aware 动态批处理** (2025) ([arXiv][9])                                    | 运行时监控**显存与 SLA**，**自适应**批大小与解码过程                       | 显存感知的批调度 + 延迟反馈回路                                                      | 在 Llama-7B+A100 上优于固定超参的吞吐/时延               | 更工程化的**在线批超参**调节方法                             | 思路与 vLLM/SGLang 的自调度兼容；尚无主干合并记录                            |
| **HyGen** (2025) ([arXiv][10])                                                | **线上/离线**融合：两阶段调度在不破坏线上 SLO 的前提下挤出“离线”算力               | 两阶段 SLO-aware 调度与隔离                                                    | 在离线共置下维持在线延迟并提升整体利用率                        | 关注**业务混部**而非单一推理吞吐                             | 可与任一后端并行部署；非 vLLM/SGLang 内核改动                              |
| **PrefillOnly** (2025) ([arXiv][11])                                          | 针对**只需 1 token 输出**的“prefill-only”工作负载做**极简 KV**与路径优化  | 仅保留**最后一层** KV、轻量运行路径                                                  | 对检索/分类式工作负载显著提速降时延                          | 面向特定负载类型的**路径裁剪**                              | 可作为特型后端/路径对接主流引擎                                           |
| **vLLM / PagedAttention** (2023→) ([arXiv][12])                               | **分页化 KV-Cache** + 预占式调度，近零碎片、易做连续批处理                  | 块级内存管理、连续批、请求抢占                                                        | 对 HF baseline 最高 **24×** 吞吐（早期报告）           | 率先把**内存分页**与**连续批**标准化                         | 已成为事实上的开源主流服务引擎之一                                          |
| **Throughput-Optimal Scheduling for LLM Serving** (2025) ([arXiv][13])        | 从理论上给**连续批处理**的吞吐上界/最优策略                               | 令牌级排队/匹配策略                                                             | 给出与实践接轨的理论最优性结果                             | 偏理论基线，指导工程实现                                   | 待工程化吸收                                                     |
| **Learning-to-Rank Scheduler** (2024)            | 预测请求**相对长度排序**，更逼近 SJF 以降延迟/提吞吐                              | LTR 训练的调度器，按预测顺序排队/并批                      | 相比现有基线更优的延迟/完成时间            | 与两者同属**自动排队/批形成**                   | 可接入 vLLM/SGLang 的 admission/排队阶段。 ([arXiv][5])                       |
| **Online Scheduling with KV Constraints** (2025) | 把 LLM 推理抽象为**带 KV 内存约束**的在线调度，给出近优策略                         | 流体/队列模型 + 在线批/KV 决策                        | 优于 vLLM/Sarathi 基线（作者报告）    | 与两者同属**自动在线调度**理论化                  | 可做 vLLM/SGLang 的**策略插件**（需低开销遥测）。 ([arXiv][8])                       |
| **Fairness-Aware Batch Formation** (2025.10)     | 在**连续/混合批**下自动平衡新/旧请求的“计算公平性”与吞吐                             | 批内配额/重排策略                                  | 在保持吞吐下显著改善不公平               | 同属批形成自动化（与 Sarathi/Chunked 互补）      | 可改造 vLLM/SGLang 的 batcher。 ([arXiv][10])                             |
| **Drift（PD-Multiplexing）** (2025)                | **相位解耦复用**：预填与解码相位分离并“就地”复用，缓解吞吐-时延拉扯                        | 相位解耦 + in-place compute 复用                 | 在多负载下同时提吞吐与守 SLO            | 与 NanoFlow/POD 同属**相位级重叠**路线        | 需要内核/调度联动，适合作为引擎深改方向。 ([Han Zhao 赵涵][11])                            |

[1]: https://arxiv.org/html/2408.00741v1?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."
[2]: https://arxiv.org/abs/2408.12757?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
[3]: https://github.com/efeslab/Nanoflow?utm_source=chatgpt.com "efeslab/Nanoflow: A throughput-oriented high-performance ..."
[4]: https://arxiv.org/abs/2308.16369?utm_source=chatgpt.com "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills"
[5]: https://docs.vllm.ai/en/v0.4.2/models/performance.html?utm_source=chatgpt.com "Performance and Tuning - vLLM"
[6]: https://arxiv.org/pdf/2401.08671?utm_source=chatgpt.com "DeepSpeed-FastGen: High-throughput Text Generation for ..."
[7]: https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/POD-Attention-ASPLOS25.pdf?utm_source=chatgpt.com "POD-Attention: Unlocking Full Prefill-Decode Overlap for ..."
[8]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5195463&utm_source=chatgpt.com "Optimizing LLM Inference: Fluid-Guided Online Scheduling ..."
[9]: https://arxiv.org/pdf/2503.05248?utm_source=chatgpt.com "Optimizing LLM Inference Throughput via Memory-aware ..."
[10]: https://arxiv.org/html/2501.14808v2?utm_source=chatgpt.com "1 Introduction"
[11]: https://arxiv.org/html/2505.07203v1?utm_source=chatgpt.com "PrefillOnly: An Inference Engine for Prefill-only Workloads ..."
[12]: https://arxiv.org/pdf/2309.06180?utm_source=chatgpt.com "Efficient Memory Management for Large Language Model ..."
[13]: https://arxiv.org/html/2504.07347v1?utm_source=chatgpt.com "Throughput-Optimal Scheduling Algorithms for LLM ..."
[14]: https://homes.cs.washington.edu/~arvind/papers/nanoflow.pdf?utm_source=chatgpt.com "NanoFlow: Towards Optimal Large Language Model ..."
[15]: https://discuss.vllm.ai/t/does-the-vllm-v1-support-speculative-decoding-now/191?utm_source=chatgpt.com "Does the vLLM v1 support Speculative Decoding now?"
[16]: https://github.com/sgl-project/sglang/issues/2273?utm_source=chatgpt.com "[Kernel] Launch two kernels for mixed chunked prefill #2273"
[17]: https://github.com/sgl-project/sglang/issues/6553?utm_source=chatgpt.com "[PD] Support Multi-Process for TokenizerManager #6553"
[18]: https://arxiv.org/html/2312.07104v1?utm_source=chatgpt.com "Efficiently Programming Large Language Models using ..."
[19]: https://iacoma.cs.uiuc.edu/iacoma-papers/hpca25_2.pdf?utm_source=chatgpt.com "DynamoLLM: Designing LLM Inference Clusters for ..."


## Qwen optimization


Catalogued how Alibaba’s Qwen-family models have been tuned across the codebase.

**Optimization Matrix**  

| Model scope                                | Optimization & commits                                                                                                                                                                                                                                        | Why it was added                                                                                                                | Feature / hardware fit                                                                                 |
| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| Qwen3-VL                                   | Vectorized fast_pos_embed_interpolate (30d08911f, af7dfb0d1) → vllm/model_executor/models/qwen3_vl.py:443                                                                                                                                                     | Removes nested Python loops and reuses tensor meshgrids to cut interpolation overhead on large vision grids                     | Handles arbitrary grid_thw shapes for image/video batches, improving deepstack throughput              |
| Qwen3-VL                                   | Non-blocking data flow & CPU-built cu_seqlens (b2155ed31, 2c1c7dfb3, 0426e3c5e) → vllm/model_executor/models/qwen3_vl.py:492, 529-549, 1296                                                                                                                   | Keeps host/device copies async, avoids redundant concatenate, flattens 3D MM tensors without reallocation                       | Sustains high-frame video pipelines and DP sharding without CUDA synchronizations                      |
| Qwen3-VL                                   | Triton interleaved MRoPE + ViT data-parallel/back-end gating (cea91a32f, 3127274d0, c242c9803) → vllm/model_executor/layers/rotary_embedding/mrope.py:19, vllm/model_executor/models/qwen3_vl.py:316, 1420                                                    | Custom kernel accelerates interleaved 3D RoPE; DP path removes TP-only constraint while forcing SDPA on Blackwell for stability | Matches Qwen3-VL vision tower on GB200/B200 with MoE+DP and interleaved rotary requirements            |
| Qwen2.5-VL                                 | Flash/xFormers friendly attention staging (02ed8a1fb, 70b808fe1, 47c712621, c242c9803) → vllm/model_executor/models/qwen2_5_vl.py:274, 627, 814                                                                                                               | Precomputes seqlens once, routes through backend-specific fast paths, and supports upstream FlashAttn fallback                  | Smooths attention across FlashAttn, TORCH_SDPA, xFormers on long video contexts and new Blackwell GPUs |
| Qwen2.5-VL                                 | O(n) inverse permutation & pinned staging (67da5720d4, e283976f3, b0d1213ac) → vllm/model_executor/models/qwen2_5_vl.py:769-897, 829-831, 892-897                                                                                                             | Replaces argsort with in-place inverse, pins buffers, and streams seqlens to GPU asynchronously                                 | Large-window adapters and deepstack vision merges stay latency-neutral under heavy video loads         |
| Qwen2-VL                                   | Vision data-parallel mode & memoized seqlens (c98be0a23, 70b808fe1) → vllm/model_executor/models/qwen2_vl.py:244-716, 767-805                                                                                                                                 | Allows disabling TP in favor of DP for the encoder and caches attention metadata                                                | Optimizes multi-GPU deployments handling many images per request                                       |
| Qwen2/2.5-VL startup                       | Max-token heuristics replace dummy image probing (2c5302fad) → vllm/model_executor/models/qwen2_vl.py:931-940, vllm/multimodal/profiling.py:353-360                                                                                                           | Avoids generating huge fake inputs to find token limits, slashing initialization time                                           | Keeps autoregressive limits aligned with true max resolution / frame counts before serving             |
| Qwen series (CUDA & ROCm)                  | Rotary dispatch abstraction (5e4a8223c) → vllm/model_executor/layers/rotary_embedding/common.py:74, vllm/model_executor/models/qwen2_vl.py:274                                                                                                                | Dynamically selects FlashAttn kernels (CUDA/ROCm) instead of Python fallback                                                    | Maintains high-throughput rotary embedding on both CUDA and ROCm stacks                                |
| Qwen2 text & MoE                           | FP8 KV-scale remap + tuned A100 configs (da971ec7a, bd4397352) → vllm/model_executor/models/qwen2.py:436-451, vllm/model_executor/layers/fused_moe/configs/E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json#L1                                              | Loads FP8 cache scales correctly and ships empirically tuned fused-MoE kernels                                                  | Targets FP8-capable GPUs and boosts Qwen2-57B-A14B throughput on A100 TP=2/4                           |
| Qwen3 MoE family (Next / Coder / Thinking) | Blackwell / H100 / H200 FP8 fused configs (238c4c170, 482e52f56, 75334956c, 9f04d9d55, 12a8414d8, f82f7a899, 7a70a7189, 569bf1c9c, c733bd5e8) → e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json#L1 | Provides block-size / warp tuning for each TP×EP layout and precision mix                                                       | Lets Qwen3 MoE variants saturate GB200/H200/H100 SKUs at FP8 W8A8 without manual retuning              |

Key observations

- Multiple commits converge on eliminating host/device syncs via pinned buffers and asynchronous transfers (vllm/model_executor/models/qwen2_5_vl.py:831, 892-897), indicating a broad push to keep multimodal pipelines GPU-bound.
- Data-parallel encoder paths (vllm/model_executor/models/qwen3_vl.py:316, vllm/model_executor/models/qwen2_vl.py:664-716) mirror each other, suggesting a reusable pattern for future Qwen multimodal towers.
- Fused-MoE JSON configs proliferate across hardware tiers; keeping them in sync with upstream kernel tuning is now central to sustaining Qwen3-* throughput.

Next steps

1. Spot-check lm-eval / throughput on GB200 and B200 with the new MoE configs to confirm expected gains on production workloads.
2. Document the DP vs TP trade-offs for operators now that encoder data-parallel is available in both Qwen2-VL and Qwen3-VL, so deployment teams pick the right topology.


| Model                  | Optimization                                                                                                                                                                            | Why it was added                                                                                                                                 | Feature / hardware fit                                                          |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- |
| Qwen3‑Next             | SharedFusedMoE wraps shared experts and only overlaps them when it pays off (vllm/model_executor/models/qwen3_next.py:161, vllm/model_executor/layers/fused_moe/shared_fused_moe.py:31) | Avoids double-computing shared experts when Expert Parallel (EP) or FlashInfer overlap would be wasted, keeping the block torch.compile friendly | Hybrid MoE layers with shared experts on DP/EP deployments                      |
| Qwen3‑Next             | Sequence-parallel MoE routing (vllm/model_executor/models/qwen3_next.py:183, vllm/model_executor/models/utils.py:784, vllm/config/parallel.py:342)                                      | Guarantees tokens are sharded across TP ranks before expert dispatch so TP × EP + DeepEP doesn’t repeat work                                     | Large-TG throughput setups using TP+EP or DeepEP all-to-all                     |
| Qwen3‑Next             | Passes GDN causal-conv metadata through the attention path (vllm/model_executor/models/qwen3_next.py:528, vllm/v1/attention/backends/gdn_attn.py:59)                                    | Supplies Triton conv kernels with precomputed offsets so gated delta nets can reuse cached state cheaply during prefills                         | Keeps the hybrid Mamba/GDN stack fast on long prefills and speculative decoding |
| Qwen3‑Next             | Splits the in-projection so FP8 checkpoints load cleanly (vllm/model_executor/models/qwen3_next.py:284, 471)                                                                            | Separates QKVZ vs BA projections, matching blockwise FP8 quant requirements and avoiding unsupported merges                                      | Enables FP8 W8A8 checkpoints across FlashInfer / Triton paths                   |
| Qwen3‑Next             | Removes CUDA-only device assumptions (vllm/model_executor/models/qwen3_next.py:306)                                                                                                     | Lets the gated delta net initialize via the current platform hook instead of torch.cuda.current_device()                                         | Unlocks ROCm and future device backends without code forks                      |
| Qwen3‑Next MoE configs | Adds FP8 tuning files for GB200 / H100 / H200 (e.g. vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json:1)                            | Ships per-topology block sizes and warp counts that match Alibaba’s published sweeps                                                             | Drop-in configs for Qwen3-Next MoE on Blackwell, Hopper, and Grace Hopper       |
| Qwen3 dense            | Dual chunk attention plumbing (vllm/model_executor/models/qwen3.py:118, 199)                                                                                                            | Exposes transformers’ dual-chunk cache so Qwen3 can serve ultra-long prompts without KV blow-up                                                  | Long-context serving (>128 K) on dense Qwen3 checkpoints                        |
| Qwen3‑MoE              | Built-in sequence-parallel dispatch (vllm/model_executor/models/qwen3_moe.py:139, 192)                                                                                                  | Uses the same chunk op to keep EP workloads strictly partitioned, avoiding duplicate expert calls                                                | TP × EP launches with DeepEP / allgather-reducescatter all-to-all               |
| Qwen3‑MoE              | Lets FusedMoE return reduced tensors directly (vllm/model_executor/models/qwen3_moe.py:167)                                                                                             | Eliminates a second tensor-parallel all-reduce at block exit, shaving latency on every layer                                                     | Any MoE run with TP>1, especially bandwidth-bound clusters                      |
| Qwen3‑MoE              | FP8 batched expert kernels (vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:937, 991; vllm/model_executor/layers/quantization/fp8.py:1021)         | Chooses BatchedTriton experts and Cutlass FP8 paths automatically when FP8 W8A8 checkpoints are loaded                                           | Delivers FP8 throughput on Triton / Cutlass backends for 64–128 experts         |
| Qwen3‑MoE              | Enables 4-bit bitsandbytes pre-quant loading (vllm/model_executor/model_loader/bitsandbytes_loader.py:467)                                                                              | Lifts the restriction on pre-quantized BNB weights for MoE blocks, keeping only the 8-bit ban                                                    | Lets Qwen3-MoE serve 4-bit BNB “prequant” checkpoints alongside dense models    |

Highlights & context

- The TP×EP fix (sequence parallel chunk) now applies uniformly to Qwen3 Next and Qwen3 MoE, so hybrid attention and experts both avoid redundant dispatch when DeepEP or naive A2A is active.
- SharedFusedMoE carved the shared expert overlap logic out of the MoE block; if your deployment disables FlashInfer or EP, the shared MLP simply executes once, which also makes torch.compile tracing cleaner.
- Dual-chunk attention landed in the base Qwen3 dense model, aligning it with Alibaba’s long-context releases; no config change is needed beyond picking a checkpoint with dual_chunk_attention_config set.

Next steps

1. Benchmark FP8 and 4-bit BNB variants on the intended GPU SKU (GB200/H100/H200) to confirm the shipped configs hit expected throughput.
2. If you rely on custom parallel settings, double-check parallel_config.use_sequence_parallel_moe so the sequence-parallel path activates where you expect.


Evidence: when the team first tuned Qwen2’s 57 B MoE on A100 they landed on the JSON E=64,N=1280,device_name=NVIDIA_A100-SXM4-80GB.json, and the commit that introduced it measured a move from 10.53 req/s (11 058 tok/s) to 12.47 req/s (13 089 tok/s) at TP=2 (see commit bd43973522ea17be50e10fbb222a22f673c8067e). That improvement came solely from swapping the default heuristics with hand‑picked block and warp sizes for that GPU. The newer GB200 FP8 tables you’re looking at (E=512,N=128,...) were produced with the same profiling workflow, just targeting the Blackwell SM layout and FP8 balance – that’s why they use more aggressive BLOCK_SIZE_N and higher num_stages.


---

**Runtime Optimizations**

| Feature / Area                          | Commits (Date)                                                      | Optimization Highlights                                                                                                                                                      | Perf Notes                                                      |
| --------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| Qwen2/2.5-VL vision attention rearrange | 5c2acb27 (2025-10-18)                                               | Dropped redundant .contiguous() after rearrange so Q/K/V stay view-backed before rotary embeddings (qwen2_vl.py, qwen2_5_vl.py).                                             | Not reported; removes extra tensor copies.                      |
| Qwen3-next gated RMSNorm                | 82e64c7 (2025-10-12)                                                | Reworked Triton LayerNorm to tile multiple rows, cache SM count, and keep ops in fp32 while avoiding host transfers; added coverage tests (layernorm_guard.py).              | No metric; fewer kernel launches and better occupancy.          |
| Qwen3-next MTP bool-mask handling       | 785d8b6 (2025-10-16)                                                | Replaced boolean tensor indexing with index_select/index_copy in qwen3_next.py and GDN attention utils to eliminate device→host copies during multi-token prediction.        | Not stated; expected higher MTP throughput.                     |
| Qwen3-VL fast_pos_embed_interpolate     | 30d0891 (2025-09-21); af7dfb0 (2025-09-22); a6049be7 (2025-10-12)   | Successive vectorizations of the bilinear weight/index build, switching to meshgrids, tensor stacking, and summed reductions instead of Python loops in qwen3_vl.py.         | No metric; reduces Python overhead for large image/video grids. |
| Qwen3-VL multimodal tensor prep         | 0426e3c5 & 2c1c7dfb (2025-10-09)                                    | _validate_and_reshape_mm_tensor now reshapes instead of concatenating lists; cu_seqlens is built via torch.cat to avoid an extra padding kernel.                             | Not reported; fewer allocations/kernels.                        |
| Qwen3-VL deepstack splitting            | 1dfea5f4 (2025-09-19)                                               | Caches vision dims (visual_dim, multiscale_dim) and removes repeated .contiguous() when splitting deepstack embeddings, trimming bookkeeping in qwen3_vl.py/qwen3_vl_moe.py. | No metric; lighter host-side work.                              |
| Qwen3-VL interleaved MRoPE              | cea91a32 (2025-09-19)                                               | Added Triton kernel that handles interleaved temporal/spatial rotary sections, keeping rotations GPU-bound (mrope.py).                                                       | Not reported; avoids slow native fallback.                      |
| Qwen3Moe fused output                   | 4f510bc2 (2025-08-20)                                               | Lets experts perform the tensor-parallel all-reduce internally (reduce_results=True), removing a redundant explicit all-reduce in qwen3_moe.py.                              | Not stated; less collective traffic.                            |
| Qwen3-next FP8 checkpoints              | ef7eefe1 (2025-09-18)                                               | Split merged linear projection into FP8-compatible pieces (in_proj_qkvz/in_proj_ba) so blockwise FP8 weights load correctly in qwen3_next.py.                                | No metric; enables FP8 inference path.                          |
| Qwen3 FP8 accuracy guard                | a258ad8b (2025-08-17)                                               | Adjusted FP8 quantization helpers to address accuracy drift for Qwen3 MoE (quantization/fp8.py).                                                                             | Not reported; quality fix to keep FP8 speed viable.             |
| Qwen3 fused RMSNorm                     | f80ae5bd (2025-05-07)                                               | Swapped attention Q/K norm calls to the fused RMSNorm custom op across Qwen3 dense and MoE models.                                                                           | No metric; fewer RMSNorm kernels.                               |
| Qwen3 reasoning parser                  | 015069b0 (2025-05-01)                                               | Streamlined reasoning text extraction to avoid repeated string splits in qwen3_reasoning_parser.py.                                                                          | Not measured; faster host parsing.                              |
| Qwen2.5-VL rotary/window pipeline       | 67da5720 (2025-05-16); e283976f (2025-09-09)                        | Recomputed rotary embeddings per (t,h,w) on device, pre-built window indices, and replaced torch.argsort with an O(n) inverse permutation in qwen2_5_vl.py.                  | No metrics; eliminates repeated cudaMemcpy and sorting.         |
| Qwen2/2.5-VL CUDA sync avoidance        | 6772bb0f (2025-08-13); 60f0843e (2025-09-08)                        | Convert grid products to Python lists before splitting embeddings to avoid implicit GPU syncs in qwen2_vl.py/qwen2_5_vl.py.                                                  | Not reported; prevents blocking host waits.                     |
| Qwen2.5-VL normalization & SDPA         | 02ed8a1f (2025-02-13); cbc8457b (2025-08-07)                        | Adopted shared RMSNorm, rewrote SDPA path to operate per-window to cut VRAM, and routed to fused RMSNorm in qwen2_5_vl.py.                                                   | No metric; more efficient norm/attention.                       |
| Qwen2/2.5-VL init & DP                  | 1298c677 (2025-08-19); 2c5302fa (2025-06-21); d49adea1 (2025-06-18) | Enabled vision-tower data parallel mode, cached profiling inputs, and switched to fast HF processors to shrink startup latency.                                              | Not reported; faster init and lower memory.                     |
| Qwen2/2.5-VL attention masks            | 47c7126 (2025-03-21)                                                | Restored attention mask precomputation for mixed window/full attention paths.                                                                                                | No metrics; reduces per-forward recompute.                      |
| Qwen2-VL cudaMemcpy reduction           | 70b808fe (2025-03-11)                                               | Builds seqlens/window metadata on CPU and reuses them to limit GPU copies during vision attention.                                                                           | Not reported; lowers cudaMemcpyAsync volume.                    |
| Qwen2 FP8 KV cache                      | da971ec7 (2024-06-19)                                               | Added FP8 KV-cache storage option in qwen2.py to shrink cache footprint/bandwidth.                                                                                           | No metrics; expected memory savings.                            |
| Qwen2 pipeline parallel                 | 1d2e7fb7 (2024-08-01)                                               | Implemented stage splits and related plumbing so Qwen2/Qwen2Moe can pipeline across GPUs.                                                                                    | No metric; enables scale-out throughput.                        |
| Qwen LoRA punica specialization         | 1f567421 (2024-06-21); 8435b207 (2024-05-17)                        | Extended Punica BGMV configs for Qwen2 and Qwen1.5 LoRA ranks to avoid generic kernels.                                                                                      | No metrics; better LoRA throughput.                             |

**Kernel Config Tuning (Fused MoE / FP8)**

|Commits (Date)|Model / HW Target|Optimization|Perf Metrics|
|---|---|---|---|
|4d0f2661 (2025-10-20)|Qwen3-30B A3/A3B on H100 (FP8 & BF16)|Added tuned Triton fused_moe configs (block shape [128,128]) for both dtypes.|Not reported.|
|f96bc364 (2025-10-15)|Qwen3-Next FP8 on H100 TP=2|Introduced TP2-specific FP8 fused_moe config.|Not reported.|
|238c4c17 (2025-09-15); 7533495 (2025-09-06); 482e52f (2025-09-04); 62f66be (2025-09-07)|Qwen3 Next / Thinking / Coder across NVIDIA GB200, B200, H200, H100|New and corrected fused_moe tuning tables for multiple expert counts; fixed Qwen3-coder block shapes and benchmarking coverage.|Not reported.|
|8ed01e32 (2025-07-25); 04ff4be3 (2025-07-28)|Qwen3-Coder-480B-A35B on NVIDIA H20-3e|Added FP8 fused_moe configs for large coder variant.|Not reported.|
|2d40665 (2025-06-11)|Qwen3-30B A3B on NVIDIA B200|Introduced B200-specific FP8 fused_moe config.|Not reported.|
|22c3c0aa (2025-06-11); 989dcee (2025-06-08); 597051e (2025-05-08)|Qwen3-235B A22B on NVIDIA H20-3e & A100|Added FP8/BF16 fused_moe configs for new hardware (H20-3e, A100).|Not reported.|
|8fc88d63 (2025-04-28)|Qwen3 MoE (H100/H200/H20 targets)|Bulk addition of tuned fused_moe configs for E=128 across multiple expert counts; updated benchmarks README.|Not reported.|
|dcbac4cb (2025-04-28)|Qwen3 dense FP8|Adjusted linear layers so FP8 compatibility works with fused kernels.|Not reported.|
|2007d4d5 & f5a3c655 (2025-05-01)|Qwen3-30B A3B / Qwen3-235B A22B FP8 on AMD MI300X|Added ROCm-specific fused_moe configs for MI300X (block shape [128,128]).|Not reported.|
|bd439735 (2024-06-14)|Qwen2-57B-A14B (TP2/TP4)|Tuned fused_moe configs for A100/H100; benchmarks show +18–20% requests/s (10.53→12.47 @TP2, 17.77→20.20 @TP4) and +18–14% tokens/s.|Throughput gains published in commit message.|

Next steps (optional):

1. Run targeted throughput tests on the most recent Qwen3-next and Qwen3-VL changes to quantify wins that weren’t reported.
2. For hardware configs, validate them on the intended devices (H100/B200/H20-3e/MI300X) to confirm stability before baking into release builds.


Model optimization methods:
- 减少 host/device 之间的 copy
- 优化 fused_moe configs，让 Triton 算子在特定 E/N/device/quantization 下更高效
- 提高 CPU/GPU 之间的 overlap


---
## notes

Mingxing Zhang:

一个 agent 的不同阶段可能业务需求就不一样，业务需求不同，需要不同 parallelism

value = function / cost
GPU 利用率已经 70~80%，单一场景空间继续提升很困难，提升多场景下，不同 function 使用不同的 cost，value 最大化


mooncake 适配 vllm 和 sglang 时，工程方法不一样，需要不同的做法，需要分别优化适配推理框架


---
## Model Optimization Summary

### Kernel

fused_moe/configs/xxx 针对不同的 expert 数量、N、device、quant 给定配置文件指导 Triton 生成算子，保证 SM 的利用率、且 share memory、registers 不爆掉

GDNAttention for Qwen3-Next

### Data Movement


---
Hybrid Qwen3-Next support in vLLM centers on a Gated DeltaNet (GDN) linear-attention stack plus shared MoE and multi-token prediction so the runtime matches the checkpoint’s architecture while staying compatible with speculative decoding and parallelism features.

|Optimization|Match-to-model feature|Performance effect|Evidence/data|
|---|---|---|---|
|Gated DeltaNet linear-attention backend with fused gating kernels|Custom Qwen3NextGatedDeltaNet layer drives causal-conv + recurrent updates, registered as a native op and backed by a dedicated metadata builder to interop with speculative decoding and CUDA graphs|Converts most layers to linear-time recurrence while keeping full-attention layers where needed; fused gating reduces extra projections; speculative metadata avoids recompute during draft acceptance|Implementation in vllm/model_executor/models/qwen3_next.py:206, custom op + kernel at vllm/model_executor/models/qwen3_next.py:1266 and 1292, backend/metadata in vllm/v1/attention/backends/gdn_attn.py:22 and 61, runtime wiring for spec decode in vllm/v1/worker/gpu_model_runner.py:1374; no benchmark numbers checked into repo|
|Shared Fused MoE with expert parallel load balancing|Qwen3NextSparseMoeBlock wires shared experts, redundant experts, and EPLB-aware routing via SharedFusedMoE and runtime hooks for updating expert maps|Maintains Qwen3 Next’s shared + routed expert mix while overlapping communication and balancing per-expert load across EP/SP, reducing straggler cost|Implementation in vllm/model_executor/models/qwen3_next.py:99; EPLB state management at vllm/model_executor/models/qwen3_next.py:1168; no MoE benchmark data present|
|NextN multi-token predictor (MTP) path|Qwen3NextMultiTokenPredictor reuses decoder layers for draft tokens and config overrides remap the HF config into the MTP runner|Enables draft-token speculation without loading a separate model, improving decode throughput when speculative decoding is enabled|MTP module in vllm/model_executor/models/qwen3_next_mtp.py:43; config remapping in vllm/config/speculative.py:202; no throughput figures published|
|Mamba-style state management for speculative decode|Inherits from MambaBase, provides dtype/shape calculators, and relaxes the default “no speculative” guard specifically for Qwen3 Next|Keeps convolution + SSM state caches in an efficient layout and allows speculative decoding—otherwise disallowed for generic Mamba models|State helpers in vllm/model_executor/models/qwen3_next.py:1218; speculative allowance in vllm/model_executor/layers/mamba/abstract.py:50; no direct measurements supplied|

No repository benchmarks quantify the above: benchmarks/ scripts cover generic MoE kernels but do not include recorded Qwen3-Next runs.

Next steps (optional): 1) run benchmarks/kernels/benchmark_moe.py or a decode throughput script with your Qwen3-Next checkpoint to obtain empirical numbers; 2) capture profiling traces to confirm the GDN layers hit the fused kernels.


引入 PD 分离后，EPxDP 时 调度器 影响更大，在线的并行模式调整，可能不如调度器的调整，让每个 rank 收到的 pattern 相对固定