Files
obsidian/projects/moe-autoscaling/Survey.md

109 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 现有工作总结
## Insights
- 层内 expert 具有 load imbalance 的特性
- 层间的 expert 激活具有可预测性
- 单个请求在 decoding 过程中的 expert 激活具有 skewness少部分 expert 会在 decoding 过程中被高频选择
## expert prefetch 与 cache
![[250624-194444.png]]
- iteration-wise (token):比较 input tokens 的相似性,根据历史记录进行预测(原理:相似的 input tokens 具有相似的 embedding经过 gate 也应该会得到相似的 expert 选择)
- layer-wise (skip):相邻层的 embedding 变化不大,用第 i 层的 expert 预测第 i+1 层的 expert
![[250624-195519.png]]
- 单个请求具有 skewness多个请求的 statistics 会相对比较均衡
- 单个 sequence 的 decoding 过程具有 skewness< 5% experts 会高频 activate
![[250627-111237.png]]
## expert placement 优化
$x_{l, e, g} \in \{0, 1\}$表述第 $l$ 层的 expert $e$ 是否放在 GPU $g$ 目标
1. load balance对任意的 $l_0$ $L(g) = \sum_{e} x_{l, e, g} \cdot N_{e}$ 的标准差最小
2. communication$\sum x_{l, e_1, g_1} \cdot x_{l+1, e_2, g_2} \cdot N_{e_1, e_2}$ 最小
## 集合通信优化
---
# Paper list
#### [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
#### [CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](https://arxiv.org/abs/2405.16444)
- 重算 10%-20% HKVD tokens 就足够把与 $A_{full}$ deviation 降到很低
- 不同 layer 之间的 HKVD tokens 相似 -> 第一层完全重算,计算 deviation 得到 HKVD tokens在第二层对这些 tokens 重算,第二层又可计算 deviation 得到 HKVD tokens 用于下一层,以此类推
![[250611-091821.jpeg]]
- MoE router 时计算 experts 的概率的 input 为 attention 的 embedding那么是否相邻层的 embedding 相同,会使得相邻层的 activation pattern 具有相似性?
#### [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066)
#### [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134)
- ploys a small neural network to learn correlations between layer inputs and expert selections
#### [Accelerating Distributed MoE Training and Inference with Lina](https://arxiv.org/abs/2210.17223)
#### [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)
#### [DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency](https://arxiv.org/abs/2408.00741)
#### [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433)
#### [Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing](https://arxiv.org/abs/2501.05313)
Assume $f$ is the token feature vector of a token, in which $f_1$ is the token ID, $f_2$ is the position ID and $f_3$ is the attention ID.
![[250701-102443.png]]
#### [SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](# SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models)
#### [MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache](https://arxiv.org/abs/2401.14361v3)
#### [MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing](https://arxiv.org/abs/2502.06643v1)
- token paths across layers are not random but instead follow structured and predictable patterns
- The primary goal of MOETUNER is to develop an expert placement strategy that minimizes two critical factors: the imbalance of token processing load across GPUs and the inter-GPU communication overhead.
- 统计前一层激活的 expert $E_{l, i}$ 和后一层激活的 expert $E_{l+1, j}$ 的 pair $\langle E_{l, i}, E_{l+1, j} \rangle$
- 该工作没有考虑动态负载的变化
#### [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https://arxiv.org/abs/2407.04656)
- Adaptive Expert Allocation
- Expert Placement Algorithm
- Flexible Token Dispatcher
#### [Accelerating Mixture-of-Experts Training with Adaptive Expert Replication](https://arxiv.org/abs/2504.19925)
#### [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling]
- allows the communication and computation tasks in training MoE models to be scheduled in an optimal way
- all-to-all collective which better utilizes intra- and inter-connect bandwidths
- supports easy extensions of customized all-to-all collectives and data compression approaches
#### [Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing](https://arxiv.org/abs/2404.16914)
- defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality” (the loads of each expert are similar in adjacent iterations)
- [ ] 总结百炼的架构,分析整体系统有什么可能可以做的
- [ ] expert scaling
前提:流量 spike模型切换数量需要启动模型为 MoE 模型
- [ ] train 和 inference 的 A2A 的区别?和 deepep 的区别?
- [ ] 容错问题?本质 GPU 数量增多更容易挂EP 是不是最容易容错的kvcache 的容错
- [ ] failure 的概率?
- [ ] replication 除了容错,能服务什么
https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs