obsidian/projects/auto-tuner/Sync.md

## 0805

在 vLLM latest 上测试不同的 parallelism configuration，使用 (input_length, output_length) = (1024, 128), (128, 1024) 两种组合，使用 Qwen3-30B-A3B。

可以观察到：
- 输入输出的长度比例不同，确实会体现出并行模式的亲和性
- 即使 vLLM 支持了 DeepEP，小 EP 下的效果确实相比于不使用 EP 没什么提升，甚至可能变差。[perplexity blog](https://www.perplexity.ai/hub/blog/lower-latency-and-higher-throughput-with-multi-node-deepseek-deployment)
- 在长输入短输出的情况下，DP 开大之后，性能明显变差
- 在短输入长输出的情况下，PP 开大之后，性能明显变差
- vLLM 每次换一个 configuration 重启，即使是现在 30B 的模型，时间开销也在 6min 左右，如果需要 configuration 的动态切换，简单重启的方式可能开销过大

(input_length, output_length) = (128, 1024)

![[projects/auto-tuner/Sync.figs/260410-105227.png]]
![[projects/auto-tuner/Sync.figs/260410-105227-1.png]]

(input_length, output_length) = (1024, 128)

![[projects/auto-tuner/Sync.figs/260410-105227-2.png]]
![[projects/auto-tuner/Sync.figs/260410-105227-3.png]]


存在的问题：
- 大模型下的结论如何验证？
- vLLM latest 自带 bug，PP+DP 会炸掉


 Feedback
- 详细看一下 vLLM 对不同 parallelism 的实现，性能区别的原因
- 思考小模型如何迁移到大模型，结论如何 scale (hardest)
- 测试真实 trace 和 micro benchmark，比较区别和关联
- 控制变量，定性的给出更多结论
- 理论建模/实验分析一些 trade-off
	- e.g. EP 下大 EP 下 bubble 变小，但需要更大的通信量，需要更大的 batch size，降低了 latency。通信量和计算 bubble，throughput 和 latency


---
## 0812

1. 了解 vLLM 对不同 parallelism 的实现
	结论：vLLM 的很多实现完全看不出道理
	例如：
	- PP 在目前的实现上，维护 PP 个 virtual engine，每个调度一个 micro batch 塞给 executor 执行，即使不考虑不同 micro batch 之间的 balance 问题，也有大量 bubble。比如 PP=4，则每个 step:
		```
		| b0 | b1 | b2 | b3 | xx | xx | xx |
		| xx | b0 | b1 | b2 | b3 | xx | xx |
		| xx | xx | b0 | b1 | b2 | b3 | xx |
		| xx | xx | xx | b0 | b1 | b2 | b3 |
		```
	- DP 的 JSQ 里，使用的 cost 定义为一个 pair `(num_waiting, num_running)`，为什么不是 `num_waiting + num_running`
2. 不同 parallelism configuration 的 search 与控制变量的测试脚本与测试，尚未做数据分析 [suspended]
3. 对于大模型的不同 parallelism config 来说，没有那么多硬件来 profile，希望有一个 fidelity 好的 simulator，最简单的想法是对 kernel 做 profile，根据 profile 结果做 search => 发现现有工作 [Vidur](https://arxiv.org/pdf/2405.05465)。
	Vidur 现存的问题：没有考虑 EP；需要离线对一组 trace profile 出一组 config
	![[projects/auto-tuner/Sync.figs/260410-105227-4.png]]
4. 并行模式的新探索
	之前的视角是，线上的 workload 是多样的，我们可能需要使用不同的并行部署模式来分别服务。
	新的视角是，百炼平台上的 model 也是多样的，多样化的模型（Qwen-30b/235b/480b, Wanx, DeepSeek-671b, Kimi-K2, ...）、多样的硬件（A800, H800, H20, ...），是否存在可能将不同的模型的不同部分（attention/MoE/...），在不同硬件上做并行。这个视角下，我们对下将硬件抽象为不同的资源能力实现资源池的概念，对上为模型提供更多灵活组合的可能性。
	存在的问题：为什么非要混？在 PD/AF-disaggregation 的情况下，每个模型继续物理意义隔离有什么问题？
	共性：需要一个对 kernel/component 的 profiler，根据 profile 的结果，至少能够理论上计算得到 overlap 的机会，不同模块使用不同硬件的能够带来的理论提升空间。

Feedback
- 批流系统的 idea => LLM 计算图的编排
- 粒度：model -> layer -> component -> kernel


---
## 0819

### vLLM profile

1. Python sucks!
	`<built-in method __getitem__ of dict object at 0x7faf7c36a1c0>` 导致 GPU bubbles
	 ![[projects/auto-tuner/Sync.figs/260410-105227-5.png]]
	 ![[projects/auto-tuner/Sync.figs/260410-105227-6.png]]

2. 目前 vLLM 推理的 kernel 已经比较高度的 fused，每一层 layer 的 forward 大概只有一次 flash attention 的调用，和一个大的 FFN 的 CUDA graph 调用


### streaming system v.s. LLM inference system

[FlexFlow](https://arxiv.org/pdf/1807.05358)
- DNN 训练场景的 parallelism search with space: Sample, Operation, Attribute, and Parameter.
- uses a MCMC search algorithm that proposes a new parallelization strategy by changing the parallelization configuration of a single operation in the previous strategy

[APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving](https://arxiv.org/pdf/2411.17651)
- 抽象出 transformer IR
- 相比 Vidur 支持 EP
- 三个组件：IR Converter and Transformer IR, Parallel Templates and Parallel Schemes Generator, Device Mapper


关于计算图排布，可以看到大家目前在 search 时采用的方式都会采用 template，现在不同的 parallelism 本质也是一种专家发现的 template，用于减少 search space。


|     | streaming system                                                                                | LLM inference system                                       |
| --- | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
|     | 持续更新计算结果，需要保存中间状态 [1]                                                                           | 需要保存 KVCache                                               |
|     | 重建中间状态需要重新处理整个 stream history [1]                                                               | 如果丢失需要重算 KVCache                                           |
|     | dataflow streaming mode: V 表示 operator，E 表示 data stream，E 中有 forward, broadcast, shuffle, keyBy | V 表示 kernel，E 表示 communication，那么 E 同样有 forward, broadcast |
|     | 通过 top-down 将整个计算图 split，得到 operator fusion [2]                                                 | pre-compiled cuda graph 也是 operator fusion                 |
|     |                                                                                                 |                                                            |
|     | 根据 workload 动态调整计算图 [3]                                                                         | 根据 workload/输入 batch 的 pattern 可能也需要动态调整计算图，如何实现？          |
|     | 计算图是固定的，更多考虑的是如何 fuse，如何分区                                                                      | 计算图本身不固定，可选的算子是多样的，并行模式是多样的，输入的 pattern 是多样的               |
|     | 通信是一个相对更简单的 dataflow，可以明确划分出 layers，通信只在 layers 之间进行                                            | 通信的模式更多样，layer 内的 AR, A2A，layer 之间的通信                      |
|     |                                                                                                 |                                                            |
1) [A Survey on the Evolution of Stream Processing Systems](https://arxiv.org/pdf/2008.00842)
2) [COLA: Optimizing Stream Processing Applications Via Graph Partitioning](https://dl.ifip.org/db/conf/middleware/middleware2009/KhandekarHPRWWAG09.pdf)
3) [Adaptive Distributed Partitioning in Apache Flink](https://datalab.csd.auth.gr/static/people/gounaris/2020SMDB.pdf)
4) [StreamScope: Continuous Reliable Distributed Processing of Big Data Streams](https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-lin-wei.pdf)
5) [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://www.usenix.org/system/files/osdi22-zheng-lianmin.pdf)


问题：
- 计算图的自动编排与基于模拟器的 search 有什么区别
- 如何选择合适的 search space，例如 micro batch 拆分之类的如何考虑
- 输入的高度 variety，不确定在拆分到不同长度的 batch 组合时是否会具有相同的计算性质

### Thinking

本质来说，我们的工作是去提出更好的抽象从而做更好的 search？

LLM structure（如 transformer）作为一个程序的话，GPU kernel 就是一种 asm instruction。我们要做的事情是：
- choose the best kernel (asm instruction) with lowest cost
- run kernel (asm instruction) with best overlap
Input: LLM structure, all usable kernels
Output: LLM compute flow

### Feedback

- trace vLLM CUDA compute flow => native (Rust) framework


---
## 0826

工程实现与学习，跑通 demo 可以将 triton -> ptx -> Rust load and launch CUDA kernel

vLLM 中大多数为 triton/flashinfer 等 lib，都支持 jit，可以使用 jit_module 进行 build & dump

如何通过 vLLM 做 inference 的计算图，duplicate 一份在一个 native 的 Rust framework 上做？

挑战：
- 如何捕获一个计算图的 kernel flow 和 data flow
- 对于不同的硬件、应对不同 input 的不同 size 的 kernel，如何自动化


Feedback

(硬件，model，workload) ---人工优化---> setup
总结 search parallelism 的 work
一个 model 的不同量化版本之间的优化方案是否通用？
matrix 384 的优化能不能搜出来


---
## 0922

1. 基于 Rust+candle 跑通了最基本的 LLM 推理，但是存在精度问题（decode 长了之后吐词会崩溃），即使是使用 candle-example 的官方示例也是如此

现有框架的问题：
- 缺少统一的抽象，适配新算子、新模型都需要大量的人力成本
- 分布式的推理优化不能提供统一指导（分布式拆分、不同硬件之间的任务分发、不同并行模式）
- 冷启动慢：PyTorch、CUDA graph
- Python 自带的问题：信息的缺失引入的二次开发的困难


实现 native 框架的研究挑战：
- 对现有大模型推理算子图的提取，一比一的复刻
- 如何给一个更好的统一的抽象
- 如何自动化切分任务适配不同硬件、如何自动化并行

工程挑战：
- Rust 下适配各类通信库可能缺乏生态
- 不同架构下可能存在的数据精度等细节问题


---
## Ongoing

### vLLM parallelism 实现细节

vLLM handles PP bubbles through several key mechanisms:

  1. Virtual Engines: Uses virtual engine IDs to manage micro-batches and enable overlapping computations across pipeline stages
  2. Efficient Communication:
    - Uses IntermediateTensors to pass only necessary data between stages
    - Implements send_tensor_dict/recv_tensor_dict for optimized communication
    - Overlaps communication with computation to reduce idle time
  3. Pipeline Coordination:
    - Carefully manages first/last pipeline stages differently
    - Initializes pipeline groups for optimal communication patterns
    - Scheduler handles token passing between stages to prevent round-trips
  4. Key Files:
    - vllm/distributed/parallel_state.py: Core pipeline group coordination
    - vllm/worker/worker_base.py: Pipeline execution flow implementation
    - vllm/worker/model_runner.py: Pipeline-aware model execution
    - vllm/v1/core/sched/scheduler.py: Pipeline-parallel scheduling logic

  The system minimizes bubbles through smart scheduling, efficient tensor communication, and overlapping computation/communication, though there's still a TODO for supporting overlapping micro-batches.

vLLM-v0

self.scheduler 维护了 virtual engines 的 scheduler 列表，为什么新请求能被加到 min_cost_scheduler 而不是第一个，然后按顺序做？
> By having equal cache engines that are separate we create the conditions for the pipeline stages to be even (especially if we enable chunked prefill) provided the cost function for adding requests to schedulers is defined appropriately.

PP 的 scheduler 和 executor 是独立的两个东西。有 pp 个 virtual engines，每个有一个 scheduler，新请求向 cost 最小的 scheduler 添加。AsyncLLM 每一次 step 时，会对每个 virtual engine 调用 schedule，得到即将执行的请求喂给整个 AsyncLLM 维护的唯一一个 executor，也就是说这一个 executor 会在一次 step 中做 pp 次 execute_model，从而保证一次 step 确实有 token 吐出来。这个角度来说，min_cost_scheduler 的含义十分不明。
![[projects/auto-tuner/Sync.figs/260410-105227-7.png]]

vLLM-v1

![[projects/auto-tuner/Sync.figs/260410-105227-8.png]]

https://github.com/vllm-project/vllm/issues/4461
https://github.com/vllm-project/vllm/issues/11945


### 测试规划

```python
from tabulate import tabulate

lst = [
    # 4 cards
    "1 4 1 800",
    "2 2 1 800",
    "4 1 1 800",
    "2 1 2 800",
    # 8 cards
    "1 8 1 1600",
    "2 4 1 1600",
    "4 2 1 1600",
    "8 1 1 1600",
    "4 1 2 1600",
    "2 1 4 1600",

    # illegal config
    # "1 2 2 800",
    # "2 2 2 1600",
    # "1 4 2 1600",
    # "1 2 4 1600",
    # "1 1 8 1600",
]

headers = ["IO", "TP", "DP", "PP", "EP", "qps", "num_requests"]

data = []
for input_len, output_len in [(1024, 128), (128, 1024)]:
    for enable_ep in [True, False]:
        for cfg in lst:
            tp, dp, pp, num_requests = map(int, cfg.split())
            num_requests = num_requests // 800 * 1000
            ep = tp * dp * pp

            if (input_len, output_len) == (1024, 128):
                if ep == 4:
                    qps_list = [10, 11, 12, 13, 14, 15, 16]
                if ep == 8:
                    qps_list = [20, 22, 24, 26, 28, 30, 32]
            if (input_len, output_len) == (128, 1024):
                if ep == 4:
                    qps_list = [2, 3, 4, 5, 6, 7, 8]
                if ep == 8:
                    qps_list = [4, 6, 8, 10, 12, 14, 16]

            for qps in qps_list:
                data.append([f"{input_len}+{output_len}", tp, dp, pp, ep if enable_ep else 0, qps, num_requests])


print(tabulate(data, headers=headers, tablefmt="github"))
```

 TP=2 DP=1 PP=4 EP=8 下每次启动两组测试后就会挂掉
 TP=2 DP=1 PP=4 EP=0 同样的问题


测试脚本的问题：
Popen 启动的 vllm server 会缓冲区阻塞，导致无法继续执行
方法 1:
```python
server_process = subprocess.Popen(
	["./run.sh"],
	stdout=subprocess.DEVNULL,
	stderr=subprocess.DEVNULL,
)
```
方法 2:
```python
with open("server.log", "w") as log_file:
    server_process = subprocess.Popen(
        ["./run.sh"],
        stdout=log_file,
        stderr=log_file
    )
```


### TBD

[Vidur: A Large-Scale Simulation Framework For LLM Inference](https://arxiv.org/abs/2405.05465)
[Llumnix: Dynamic Scheduling for Large Language Model Serving](https://arxiv.org/abs/2406.03243)
[Frontier: Simulating the Next Generation of LLM Inference Systems](https://arxiv.org/abs/2508.03148)
SimAI 也有在支持对 inference 的 simulator，差异化是什么


---
> Tao: 主要现在没有一个能驱动资源调度，同时考虑不同配置的请求调度的框架


- nanoflow
- fork in the road
- 咨询涛老师，线上扩缩容实际的 bottleneck


1. 快速扩容的工业界落地（先明确百炼是否需要快速扩容，有数据吗？）
	参考 paper fork in the road: https://www.usenix.org/system/files/osdi25-chai-xiaohu.pdf

	目标：兼容主流推理框架（如vLLM）的基础上实现极速扩容
	目前已知的技术点：
	- kvcache初始化加速：docker pause & unpause
	- cuda graph 初始化加速：Medusa: Accelerating Serverless LLM Inference with Materialization.
	- 参数加载加速：BlitzScale

2. 分析并行模式
	- 首先我们得先回答，一个系统的一种并行模式是否是最优的？这里可以参考：nanoflow https://arxiv.org/pdf/2408.12757v2
		- nanoflow 没有考虑的点：inter-node 的优化、EP 下的复杂通信状况、KVCache 引起的复杂 context 环境
	- 在这个基础上，我们是否可以发现，1）现有框架是否做的不够好； 2）在最优性能下，是否会改变之前的一些比较？

3. 使用小模型的 GPU bubble 运行部分大模型
	AlpaServe: https://www.usenix.org/system/files/osdi23-li-zhuohan.pdf
	为什么不直接在一张卡上运行多个小模型，实现大小模型的物理隔离，还能减少通信量
	![[projects/auto-tuner/Sync.figs/260410-105227-9.png]]


### 补充信息

30B 模型，load weights：15s (cached)，torch compile：180s，Dynamo bytecode transform 30s


### 0820 ~

> Candle's core goal is to _make serverless inference possible_. Full machine learning frameworks like PyTorch are very large, which makes creating instances on a cluster slow. Candle allows deployment of lightweight binaries.

native framework 的优势：可能还能优化冷启动时间