Files
obsidian/projects/moe-autoscaling/Survey.md

5.5 KiB
Raw Blame History

现有工作总结

Insights

  • 层内 expert 具有 load imbalance 的特性
  • 层间的 expert 激活具有可预测性
  • 单个请求在 decoding 过程中的 expert 激活具有 skewness少部分 expert 会在 decoding 过程中被高频选择

expert prefetch 与 cache

!250624-194444.png

  • iteration-wise (token):比较 input tokens 的相似性,根据历史记录进行预测(原理:相似的 input tokens 具有相似的 embedding经过 gate 也应该会得到相似的 expert 选择)

  • layer-wise (skip):相邻层的 embedding 变化不大,用第 i 层的 expert 预测第 i+1 层的 expert !250624-195519.png

  • 单个请求具有 skewness多个请求的 statistics 会相对比较均衡

  • 单个 sequence 的 decoding 过程具有 skewness< 5% 的 experts 会高频 activate !250627-111237.png

expert placement 优化

$x_{l, e, g} \in {0, 1}$,表述第 l 层的 expert e 是否放在 GPU g 上,目标:

  1. load balance对任意的 $l_0$,让 L(g) = \sum_{e} x_{l, e, g} \cdot N_{e} 的标准差最小
  2. communication\sum x_{l, e_1, g_1} \cdot x_{l+1, e_2, g_2} \cdot N_{e_1, e_2} 最小

集合通信优化


Paper list

fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

  • 重算 10%-20% HKVD tokens 就足够把与 A_{full} 的 deviation 降到很低
  • 不同 layer 之间的 HKVD tokens 相似 -> 第一层完全重算,计算 deviation 得到 HKVD tokens在第二层对这些 tokens 重算,第二层又可计算 deviation 得到 HKVD tokens 用于下一层,以此类推 !250611-091821.jpeg
  • MoE router 时计算 experts 的概率的 input 为 attention 的 embedding那么是否相邻层的 embedding 相同,会使得相邻层的 activation pattern 具有相似性?

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

  • ploys a small neural network to learn correlations between layer inputs and expert selections

Accelerating Distributed MoE Training and Inference with Lina

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference

Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing

Assume f is the token feature vector of a token, in which f_1 is the token ID, f_2 is the position ID and f_3 is the attention ID. !250701-102443.png

[SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](# SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models)

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

  • token paths across layers are not random but instead follow structured and predictable patterns
  • The primary goal of MOETUNER is to develop an expert placement strategy that minimizes two critical factors: the imbalance of token processing load across GPUs and the inter-GPU communication overhead.
  • 统计前一层激活的 expert E_{l, i} 和后一层激活的 expert E_{l+1, j} 的 pair \langle E_{l, i}, E_{l+1, j} \rangle
  • 该工作没有考虑动态负载的变化

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

  • Adaptive Expert Allocation
  • Expert Placement Algorithm
  • Flexible Token Dispatcher

Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

[ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling]

  • allows the communication and computation tasks in training MoE models to be scheduled in an optimal way
  • all-to-all collective which better utilizes intra- and inter-connect bandwidths
  • supports easy extensions of customized all-to-all collectives and data compression approaches

Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

  • defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality” (the loads of each expert are similar in adjacent iterations)

  • 总结百炼的架构,分析整体系统有什么可能可以做的

  • expert scaling 前提:流量 spike模型切换数量需要启动模型为 MoE 模型

  • train 和 inference 的 A2A 的区别?和 deepep 的区别?

  • 容错问题?本质 GPU 数量增多更容易挂EP 是不是最容易容错的kvcache 的容错

    • failure 的概率?
    • replication 除了容错,能服务什么

https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs