Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

5.1 KiB

Raw Permalink Blame History

fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

TL;DR

由于 MoE 的稀疏稀疏激活特性，且端侧设备通常 GPU 内存十分紧张，因此会选择把当前激活的 expert load 到 GPU 上，这样会带来很大的推理延迟，有一系列工作通过 expert prefetch 来缓解这个问题。本工作提出了 fine-grained 的 expert prefetch 和 cache 策略，通过提出新的数据结构 expert map 对历史请求的 expert 激活情况进行存储。对于新请求，使用同时基于 semantic 和 trajectory 的 expert similarity match 方案，在 expert map store 中与历史记录进行 match，从而进行 expert 的 prefetch。 !

Motivation

现有的 expert offloading 方案不能实现最优的 latency-memory trade-off。 !

大多数基于 MoE 的 LLM 采用 decoder-only 架构，相较于 encoder-decoder 架构的 LLM，其专家激活模式更为均匀，专家访问的偏斜程度较低。
MoE LLM 在训练时引入了一种独特的负载均衡损失函数，该损失函数强制门控网络在同一 MoE 层内平衡分配到各个 expert 的 token 数量，从而确保在整个训练过程中没有 expert 处于闲置状态。这种平衡的路由机制削弱了专家激活模式的可预测性，因此使现有的解决方案不太高效。

coarse-grained 的 expert offloading 会导致：

expert hit rate 较低
忽视 MoE 模型和 prompts 的多样性 !

因此本工作要解决的三个核心问题：

How to maximize expert hit rate when prefetching and offloading experts?
How to adapt to different MoE models and prompts?
How to avoid additional system overheads when managing experts?

Design

expert map 的信息采集 !
expert map search !

Semantic-based expert map search

\text{score}_{x, y}^{\text{sem}} = \frac{\text{sem}_x^{\text{new}} \cdot \text{sem}_y^{\text{new}}}{||\text{sem}_x^{\text{new}}|| \cdot ||\text{sem}_y^{\text{new}}||}

\text{sem} \in \mathbb{R}^{1 \times H}

Trajectory-based expert map search

\text{score}_{x, y}^{\text{map}} = \frac{\text{map}_x^{\text{new}} \cdot \text{map}_y^{\text{new}}}{||\text{map}_x^{\text{new}}|| \cdot ||\text{map}_y^{\text{new}}||}

在计算第 l 层时，$\text{map} \in \mathbb{R}^{1 \times (l - 1)J}$，J 为每层的 expert 数量

本工作分析了 similarity 和 expert cache hit rate 的相关性： !

expert prefetch

\delta_l = \text{Clip}(1 - \text{score}, 0, 1)

选择 Top-N 个 experts，使得他们的 probability p_{l, j} 之和大于 $\delta_l$：

\sum_{E_{l, j} \in E_{\text{prefetch}}} p_{l, j} \geq \delta_l

且满足，K \leq N \leq J

expert map storage deduplication

对于新 batch 中的每个 $x$，当前 expert map storage 中的每个 $y$，可以计算 redundancy：

\text{RDY}_{x, y} = \frac{d}{L} \text{score}^{sem}_{x, y} + \frac{L - d}{L} \text{score}^{map}_{x, y}

expert map 去重可以形式化为 Minimum Sphere Covering 问题

现有研究证明：维护至少 2LJ 个 expert maps 保证新输入至少能找到 75% 以上相似的 expert map，\frac{1}{2} LJ\ln(LJ) 个时能达到 98%。当 L \leq 128, J \leq 96 时，只需维护不超过 50K 个 expert maps，大约 200MB CPU memory。

expert cache and eviction

expert prefetching priority 对于每个 E_{l, j} \in E_{\text{prefetch}}:

\text{Pri}_{l, j}^{\text{prefetch}} = \frac{p_{l, j}}{l - l_{\text{now}}}

expert eviction priority

\text{Pri}_{l, j}^{\text{evict}} = \frac{1}{p_{l, j} \cdot freq_{l, j}}

evict 时不考虑 LRU，因为违背了 layers 向前计算的顺序性本质

Evaluation

Testbed

6 * RTX 3090 with 24GB GPU memory
PCIe 4.0 with 32GB/s

Models

Mixtral 8x7B
Qwen1.5-MoE
Phi3.5-MoE

Traces

LMSYS-Chat-1M
ShareGPT
Online serving: Microsoft Azure

Baselines

MoE-Infiity
ProMoE
Mixtral-Offloading
DeepSpeed-Inference

整体性能测试结果

相比于 DeepSpeed-Inference, Mixtral-Offloading, Pro-MoE, MoE-Infinity，fMoE 将 average TTFT 降低了 44%, 35%, 33%, 30%，将 average TPOT 降低了 70%, 61%, 55%, 48%，将 average expert hit rate 提高了 147%, 11%, 34%, 63%

online serving 测试结果 !
ablation study：
- speculate: Mixtral-Offloading & ProMoE
- hit count: MoE Infinity
- Map (T): only trajectory
- Map (T+S): both trajectory and semantic
- Map (T+S+\delta): all features !
Sensitivity: prefetch distance (d) When the prefetch distance is small (< 3), fMoE cannot perfectly hide its system delay from the inference process, such as the map matching and expert prefetching, leading to the increase of inference latency. With larger prefetch distances (> 3), fMoE has worse expert hit rates that also degrade the performance. Therefore, we set the prefetch distance d to 3 for evaluating fMoE. !
overhead 分析：expert prefetching, map matching 和 map update 是异步的 background，实际对 E2E 有影响的 < 30ms（5% 每个 iteration） !

5.1 KiB Raw Permalink Blame History Unescape Escape

TL;DR

Motivation

Design

Evaluation

5.1 KiB

Raw Permalink Blame History