Files
obsidian/phd/papers/fMoE.md

120 lines
5.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
# TL;DR
由于 MoE 的稀疏稀疏激活特性,且端侧设备通常 GPU 内存十分紧张,因此会选择把当前激活的 expert load 到 GPU 上,这样会带来很大的推理延迟,有一系列工作通过 expert prefetch 来缓解这个问题。
本工作提出了 fine-grained 的 expert prefetch 和 cache 策略,通过提出新的数据结构 expert map 对历史请求的 expert 激活情况进行存储。对于新请求,使用同时基于 **semantic****trajectory** 的 expert similarity match 方案,在 expert map store 中与历史记录进行 match从而进行 expert 的 prefetch。
![[250629-172314.png]]
# Motivation
现有的 expert offloading 方案不能实现最优的 latency-memory trade-off。
![[250629-222203.png]]
- 大多数基于 MoE 的 LLM 采用 decoder-only 架构,相较于 encoder-decoder 架构的 LLM其专家激活模式更为均匀专家访问的偏斜程度较低。
- MoE LLM 在训练时引入了一种独特的负载均衡损失函数,该损失函数强制门控网络在同一 MoE 层内平衡分配到各个 expert 的 token 数量,从而确保在整个训练过程中没有 expert 处于闲置状态。这种平衡的路由机制削弱了专家激活模式的可预测性,因此使现有的解决方案不太高效。
coarse-grained 的 expert offloading 会导致:
- expert hit rate 较低
- 忽视 MoE 模型和 prompts 的多样性
![[250629-221659.png]]
因此本工作要解决的三个核心问题:
- How to maximize expert hit rate when prefetching and offloading experts?
- How to adapt to different MoE models and prompts?
- How to avoid additional system overheads when managing experts?
# Design
1. expert map 的信息采集
![[250629-153138.png]]
2. expert map search
![[250629-161518.png]]
- Semantic-based expert map search
$$\text{score}_{x, y}^{\text{sem}} = \frac{\text{sem}_x^{\text{new}} \cdot \text{sem}_y^{\text{new}}}{||\text{sem}_x^{\text{new}}|| \cdot ||\text{sem}_y^{\text{new}}||}$$
$\text{sem} \in \mathbb{R}^{1 \times H}$
- Trajectory-based expert map search
$$\text{score}_{x, y}^{\text{map}} = \frac{\text{map}_x^{\text{new}} \cdot \text{map}_y^{\text{new}}}{||\text{map}_x^{\text{new}}|| \cdot ||\text{map}_y^{\text{new}}||}$$
在计算第 $l$ 层时,$\text{map} \in \mathbb{R}^{1 \times (l - 1)J}$$J$ 为每层的 expert 数量
本工作分析了 similarity 和 expert cache hit rate 的相关性:
![[250629-163825.png]]
3. expert prefetch
$$\delta_l = \text{Clip}(1 - \text{score}, 0, 1)$$
选择 Top-N 个 experts使得他们的 probability $p_{l, j}$ 之和大于 $\delta_l$
$$\sum_{E_{l, j} \in E_{\text{prefetch}}} p_{l, j} \geq \delta_l$$
且满足,$K \leq N \leq J$
4. expert map storage deduplication
对于新 batch 中的每个 $x$,当前 expert map storage 中的每个 $y$,可以计算 redundancy
$$\text{RDY}_{x, y} = \frac{d}{L} \text{score}^{sem}_{x, y} + \frac{L - d}{L} \text{score}^{map}_{x, y}$$
expert map 去重可以形式化为 Minimum Sphere Covering 问题
现有研究证明:维护至少 $2LJ$ 个 expert maps 保证新输入至少能找到 75% 以上相似的 expert map$\frac{1}{2} LJ\ln(LJ)$ 个时能达到 98%。当 $L \leq 128, J \leq 96$ 时,只需维护不超过 50K 个 expert maps大约 200MB CPU memory。
5. expert cache and eviction
- expert prefetching priority
对于每个 $E_{l, j} \in E_{\text{prefetch}}$:
$$\text{Pri}_{l, j}^{\text{prefetch}} = \frac{p_{l, j}}{l - l_{\text{now}}}$$
- expert eviction priority
$$\text{Pri}_{l, j}^{\text{evict}} = \frac{1}{p_{l, j} \cdot freq_{l, j}}$$
evict 时不考虑 LRU因为违背了 layers 向前计算的顺序性本质
# Evaluation
Testbed
- 6 * RTX 3090 with 24GB GPU memory
- PCIe 4.0 with 32GB/s
Models
- Mixtral 8x7B
- Qwen1.5-MoE
- Phi3.5-MoE
Traces
- LMSYS-Chat-1M
- ShareGPT
- Online serving: Microsoft Azure
Baselines
1. MoE-Infiity
2. ProMoE
3. Mixtral-Offloading
4. DeepSpeed-Inference
- 整体性能测试结果
![[250629-173919.png]]
相比于 DeepSpeed-Inference, Mixtral-Offloading, Pro-MoE, MoE-InfinityfMoE 将 average TTFT 降低了 44%, 35%, 33%, 30%,将 average TPOT 降低了 70%, 61%, 55%, 48%,将 average expert hit rate 提高了 147%, 11%, 34%, 63%
- online serving 测试结果
![[250629-174619.png]]
- ablation study
- speculate: Mixtral-Offloading & ProMoE
- hit count: MoE Infinity
- Map (T): only trajectory
- Map (T+S): both trajectory and semantic
- Map (T+S+$\delta$): all features
![[250629-175351.png]]
- Sensitivity: prefetch distance (d)
When the prefetch distance is small (< 3), fMoE cannot perfectly hide its system delay from the inference process, such as the map matching and expert prefetching, leading to the increase of inference latency. With larger prefetch distances (> 3), fMoE has worse expert hit rates that also degrade the performance. Therefore, we set the prefetch distance d to 3 for evaluating fMoE.
![[250629-175536.png]]
- overhead 分析expert prefetching, map matching 和 map update 是异步的 background实际对 E2E 有影响的 < 30ms5% 每个 iteration
![[250629-175858.png]]