Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/moe-autoscaling/Survey.md
+++ b/projects/moe-autoscaling/Survey.md
@@ -0,0 +1,109 @@
+# 现有工作总结
+
+## Insights
+
+- 层内 expert 具有 load imbalance 的特性
+- 层间的 expert 激活具有可预测性
+- 单个请求在 decoding 过程中的 expert 激活具有 skewness，少部分 expert 会在 decoding 过程中被高频选择
+
+## expert prefetch 与 cache
+
+![[250624-194444.png]]
+- iteration-wise (token)：比较 input tokens 的相似性，根据历史记录进行预测（原理：相似的 input tokens 具有相似的 embedding，经过 gate 也应该会得到相似的 expert 选择）
+- layer-wise (skip)：相邻层的 embedding 变化不大，用第 i 层的 expert 预测第 i+1 层的 expert
+![[250624-195519.png]]
+
+- 单个请求具有 skewness，多个请求的 statistics 会相对比较均衡
+- 单个 sequence 的 decoding 过程具有 skewness，< 5% 的 experts 会高频 activate
+![[250627-111237.png]]
+
+## expert placement 优化
+
+$x_{l, e, g} \in \{0, 1\}$，表述第 $l$ 层的 expert $e$ 是否放在 GPU $g$ 上，目标：
+1. load balance：对任意的 $l_0$，让 $L(g) = \sum_{e} x_{l, e, g} \cdot N_{e}$ 的标准差最小
+2. communication：$\sum x_{l, e_1, g_1} \cdot x_{l+1, e_2, g_2} \cdot N_{e_1, e_2}$ 最小
+
+
+## 集合通信优化
+
+
+
+
+
+
+
+
+---
+# Paper list
+
+#### [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
+
+#### [CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](https://arxiv.org/abs/2405.16444)
+
+- 重算 10%-20% HKVD tokens 就足够把与 $A_{full}$ 的 deviation 降到很低
+- 不同 layer 之间的 HKVD tokens 相似 -> 第一层完全重算，计算 deviation 得到 HKVD tokens，在第二层对这些 tokens 重算，第二层又可计算 deviation 得到 HKVD tokens 用于下一层，以此类推
+![[250611-091821.jpeg]]
+- MoE router 时计算 experts 的概率的 input 为 attention 的 embedding，那么是否相邻层的 embedding 相同，会使得相邻层的 activation pattern 具有相似性？
+
+#### [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066)
+
+#### [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134)
+
+- ploys a small neural network to learn correlations between layer inputs and expert selections
+
+#### [Accelerating Distributed MoE Training and Inference with Lina](https://arxiv.org/abs/2210.17223)
+
+#### [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)
+
+#### [DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency](https://arxiv.org/abs/2408.00741)
+
+#### [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433)
+
+#### [Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing](https://arxiv.org/abs/2501.05313)
+
+Assume $f$ is the token feature vector of a token, in which $f_1$ is the token ID, $f_2$ is the position ID and $f_3$ is the attention ID.
+![[250701-102443.png]]
+
+#### [SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](# SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models)
+
+#### [MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache](https://arxiv.org/abs/2401.14361v3)
+
+#### [MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing](https://arxiv.org/abs/2502.06643v1)
+
+- token paths across layers are not random but instead follow structured and predictable patterns
+- The primary goal of MOETUNER is to develop an expert placement strategy that minimizes two critical factors: the imbalance of token processing load across GPUs and the inter-GPU communication overhead.
+- 统计前一层激活的 expert $E_{l, i}$ 和后一层激活的 expert $E_{l+1, j}$ 的 pair $\langle E_{l, i}, E_{l+1, j} \rangle$
+- 该工作没有考虑动态负载的变化
+
+#### [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https://arxiv.org/abs/2407.04656)
+
+- Adaptive Expert Allocation
+- Expert Placement Algorithm
+- Flexible Token Dispatcher
+
+#### [Accelerating Mixture-of-Experts Training with Adaptive Expert Replication](https://arxiv.org/abs/2504.19925)
+
+
+#### [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling]
+
+- allows the communication and computation tasks in training MoE models to be scheduled in an optimal way
+- all-to-all collective which better utilizes intra- and inter-connect bandwidths
+- supports easy extensions of customized all-to-all collectives and data compression approaches
+
+#### [Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing](https://arxiv.org/abs/2404.16914)
+
+- defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality” (the loads of each expert are similar in adjacent iterations)
+
+
+
+- [ ] 总结百炼的架构，分析整体系统有什么可能可以做的
+- [ ] expert scaling
+	前提：流量 spike，模型切换（数量？）需要启动，模型为 MoE 模型
+- [ ] train 和 inference 的 A2A 的区别？和 deepep 的区别？
+- [ ] 容错问题？本质 GPU 数量增多，更容易挂，EP 是不是最容易容错的？kvcache 的容错
+	- [ ] failure 的概率？
+	- [ ] replication 除了容错，能服务什么
+
+
+
+https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs