Initial commit: obsidian to gitea
This commit is contained in:
119
phd/papers/fMoE.md
Normal file
119
phd/papers/fMoE.md
Normal file
@@ -0,0 +1,119 @@
|
||||
[fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
|
||||
|
||||
# TL;DR
|
||||
|
||||
由于 MoE 的稀疏稀疏激活特性,且端侧设备通常 GPU 内存十分紧张,因此会选择把当前激活的 expert load 到 GPU 上,这样会带来很大的推理延迟,有一系列工作通过 expert prefetch 来缓解这个问题。
|
||||
本工作提出了 fine-grained 的 expert prefetch 和 cache 策略,通过提出新的数据结构 expert map 对历史请求的 expert 激活情况进行存储。对于新请求,使用同时基于 **semantic** 和 **trajectory** 的 expert similarity match 方案,在 expert map store 中与历史记录进行 match,从而进行 expert 的 prefetch。
|
||||
![[250629-172314.png]]
|
||||
|
||||
# Motivation
|
||||
|
||||
现有的 expert offloading 方案不能实现最优的 latency-memory trade-off。
|
||||
![[250629-222203.png]]
|
||||
|
||||
- 大多数基于 MoE 的 LLM 采用 decoder-only 架构,相较于 encoder-decoder 架构的 LLM,其专家激活模式更为均匀,专家访问的偏斜程度较低。
|
||||
- MoE LLM 在训练时引入了一种独特的负载均衡损失函数,该损失函数强制门控网络在同一 MoE 层内平衡分配到各个 expert 的 token 数量,从而确保在整个训练过程中没有 expert 处于闲置状态。这种平衡的路由机制削弱了专家激活模式的可预测性,因此使现有的解决方案不太高效。
|
||||
|
||||
coarse-grained 的 expert offloading 会导致:
|
||||
- expert hit rate 较低
|
||||
- 忽视 MoE 模型和 prompts 的多样性
|
||||
![[250629-221659.png]]
|
||||
|
||||
因此本工作要解决的三个核心问题:
|
||||
- How to maximize expert hit rate when prefetching and offloading experts?
|
||||
- How to adapt to different MoE models and prompts?
|
||||
- How to avoid additional system overheads when managing experts?
|
||||
|
||||
# Design
|
||||
|
||||
1. expert map 的信息采集
|
||||
![[250629-153138.png]]
|
||||
|
||||
2. expert map search
|
||||
![[250629-161518.png]]
|
||||
|
||||
- Semantic-based expert map search
|
||||
$$\text{score}_{x, y}^{\text{sem}} = \frac{\text{sem}_x^{\text{new}} \cdot \text{sem}_y^{\text{new}}}{||\text{sem}_x^{\text{new}}|| \cdot ||\text{sem}_y^{\text{new}}||}$$
|
||||
$\text{sem} \in \mathbb{R}^{1 \times H}$
|
||||
|
||||
|
||||
- Trajectory-based expert map search
|
||||
$$\text{score}_{x, y}^{\text{map}} = \frac{\text{map}_x^{\text{new}} \cdot \text{map}_y^{\text{new}}}{||\text{map}_x^{\text{new}}|| \cdot ||\text{map}_y^{\text{new}}||}$$
|
||||
在计算第 $l$ 层时,$\text{map} \in \mathbb{R}^{1 \times (l - 1)J}$,$J$ 为每层的 expert 数量
|
||||
|
||||
本工作分析了 similarity 和 expert cache hit rate 的相关性:
|
||||
![[250629-163825.png]]
|
||||
|
||||
3. expert prefetch
|
||||
|
||||
$$\delta_l = \text{Clip}(1 - \text{score}, 0, 1)$$
|
||||
|
||||
选择 Top-N 个 experts,使得他们的 probability $p_{l, j}$ 之和大于 $\delta_l$:
|
||||
$$\sum_{E_{l, j} \in E_{\text{prefetch}}} p_{l, j} \geq \delta_l$$
|
||||
且满足,$K \leq N \leq J$
|
||||
|
||||
|
||||
4. expert map storage deduplication
|
||||
|
||||
对于新 batch 中的每个 $x$,当前 expert map storage 中的每个 $y$,可以计算 redundancy:
|
||||
$$\text{RDY}_{x, y} = \frac{d}{L} \text{score}^{sem}_{x, y} + \frac{L - d}{L} \text{score}^{map}_{x, y}$$
|
||||
|
||||
expert map 去重可以形式化为 Minimum Sphere Covering 问题
|
||||
|
||||
现有研究证明:维护至少 $2LJ$ 个 expert maps 保证新输入至少能找到 75% 以上相似的 expert map,$\frac{1}{2} LJ\ln(LJ)$ 个时能达到 98%。当 $L \leq 128, J \leq 96$ 时,只需维护不超过 50K 个 expert maps,大约 200MB CPU memory。
|
||||
|
||||
5. expert cache and eviction
|
||||
|
||||
- expert prefetching priority
|
||||
对于每个 $E_{l, j} \in E_{\text{prefetch}}$:
|
||||
$$\text{Pri}_{l, j}^{\text{prefetch}} = \frac{p_{l, j}}{l - l_{\text{now}}}$$
|
||||
- expert eviction priority
|
||||
$$\text{Pri}_{l, j}^{\text{evict}} = \frac{1}{p_{l, j} \cdot freq_{l, j}}$$
|
||||
evict 时不考虑 LRU,因为违背了 layers 向前计算的顺序性本质
|
||||
|
||||
# Evaluation
|
||||
|
||||
Testbed
|
||||
- 6 * RTX 3090 with 24GB GPU memory
|
||||
- PCIe 4.0 with 32GB/s
|
||||
|
||||
Models
|
||||
- Mixtral 8x7B
|
||||
- Qwen1.5-MoE
|
||||
- Phi3.5-MoE
|
||||
|
||||
Traces
|
||||
- LMSYS-Chat-1M
|
||||
- ShareGPT
|
||||
- Online serving: Microsoft Azure
|
||||
|
||||
Baselines
|
||||
1. MoE-Infiity
|
||||
2. ProMoE
|
||||
3. Mixtral-Offloading
|
||||
4. DeepSpeed-Inference
|
||||
|
||||
- 整体性能测试结果
|
||||
|
||||
![[250629-173919.png]]
|
||||
|
||||
相比于 DeepSpeed-Inference, Mixtral-Offloading, Pro-MoE, MoE-Infinity,fMoE 将 average TTFT 降低了 44%, 35%, 33%, 30%,将 average TPOT 降低了 70%, 61%, 55%, 48%,将 average expert hit rate 提高了 147%, 11%, 34%, 63%
|
||||
|
||||
- online serving 测试结果
|
||||
![[250629-174619.png]]
|
||||
|
||||
- ablation study:
|
||||
- speculate: Mixtral-Offloading & ProMoE
|
||||
- hit count: MoE Infinity
|
||||
- Map (T): only trajectory
|
||||
- Map (T+S): both trajectory and semantic
|
||||
- Map (T+S+$\delta$): all features
|
||||
![[250629-175351.png]]
|
||||
|
||||
|
||||
- Sensitivity: prefetch distance (d)
|
||||
When the prefetch distance is small (< 3), fMoE cannot perfectly hide its system delay from the inference process, such as the map matching and expert prefetching, leading to the increase of inference latency. With larger prefetch distances (> 3), fMoE has worse expert hit rates that also degrade the performance. Therefore, we set the prefetch distance d to 3 for evaluating fMoE.
|
||||
![[250629-175536.png]]
|
||||
|
||||
- overhead 分析:expert prefetching, map matching 和 map update 是异步的 background,实际对 E2E 有影响的 < 30ms(5% 每个 iteration)
|
||||
![[250629-175858.png]]
|
||||
Reference in New Issue
Block a user