Initial commit: obsidian to gitea
BIN
projects/moe-autoscaling/Bailian Arch.figs/250715-135101.png
Normal file
|
After Width: | Height: | Size: 1.1 MiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250715-192813.png
Normal file
|
After Width: | Height: | Size: 587 KiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250715-202224.png
Normal file
|
After Width: | Height: | Size: 702 KiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250715-202246.png
Normal file
|
After Width: | Height: | Size: 561 KiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250715-202304.png
Normal file
|
After Width: | Height: | Size: 458 KiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250716-011140.png
Normal file
|
After Width: | Height: | Size: 1.7 MiB |
BIN
projects/moe-autoscaling/Bailian Arch.figs/250716-012653.png
Normal file
|
After Width: | Height: | Size: 268 KiB |
130
projects/moe-autoscaling/Bailian Arch.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# 名词解释
|
||||
|
||||
- **CPFS**: Cloud Parallel File Storage. CPFS provides a unified namespace and supports concurrent access of hundreds of machines. CPFS also provides an I/O throughput of tens of GB/s and millions of IOPS to ensure a sub-millisecond latency.
|
||||
- **ODPS (MaxCompute)**: MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
|
||||
- **CEN**: Cloud Enterprise Network. CEN uses transit routers deployed in different regions to build a full-mesh network on top of the Alibaba Cloud global transmission network.
|
||||
- **BladeLLM**: BladeLLM is an inference engine tailored for large language model (LLM) optimization and high-performance model deployment. BladeLLM has an advanced technical architecture and provides a user-friendly interface and outstanding performance to address new opportunities and challenges in LLM fields. This makes BladeLLM a suitable choice for enterprises that want to deploy LLMs and use the LLMs to perform inference.
|
||||
|
||||
# Arch
|
||||
|
||||
![[250716-011140.png]]
|
||||
|
||||
百炼的资源池的计算任务有 4 层:
|
||||
- 实时推理任务
|
||||
- Batch 推理任务
|
||||
- 闲时训练任务
|
||||
- 闲时数据处理任务
|
||||
|
||||
|
||||
## 网关
|
||||
|
||||
接口协议层面提供了标准的 HTTP,HTTP SSE 和 Websocket 的接入
|
||||
用户 API 的接入、鉴权、路由、限流、协议转换、计量、计费(部分)、内容安全和所有与之配套的业务逻辑
|
||||
网关层同时作为平台的基础服务层,提供 Apikey、文件、用户空间、资源权限等基础服务能力
|
||||
|
||||
|
||||
## model serving
|
||||
|
||||
重构前:每个 model 部署一个 allinone,简单理解为一个 model 带着所有的 agent/plugin 作为一个最小单元进行部署。
|
||||
![[250715-192813.png]]
|
||||
|
||||
重构后:每个 agent/plugin 单独部署(必然趋势),那么确实会存在 gateway -> agent1 -> gateway -> agent2 -> ... 这种不断通信的模式。尤其在 model 的 output 越来越长、模态种类越来越多(图片、视频)的场景下,通信开销确实不可忽视。
|
||||
|
||||
Q: 不同 agent 之间的数据如何传递?都需要经过网关再发给另一端吗?是否存在一个 global storage 存放不同 agent 的 output,实现共享(那么也需要考虑 global storage 内的隔离)
|
||||
|
||||
![[250715-135101.png]]
|
||||
|
||||
## VL model
|
||||
|
||||
通过一系列一体化工作来减少图片/视频的传输,否则所有的图片/视频数据在经过每个节点时都需要传输,带来大量的开销。
|
||||
|
||||
![[250715-202224.png]]
|
||||
|
||||
![[250715-202246.png]]
|
||||
|
||||
![[250715-202304.png]]
|
||||
|
||||
|
||||
|
||||
## global batching
|
||||
|
||||
方案:
|
||||
|
||||
> 在Global Dynamic Batching的设计上,我们选择使用拉模式来进行调度,一句话概括为API Server先将请求发送至Redis全局队列,推理节点(Model Serving)从全局队列中拉取请求。
|
||||
|
||||
Claim:
|
||||
|
||||
> RR模式(推)负载不均衡:
|
||||
> RR模式所有请求在调度层的视角下是相同的,LLM请求的差异性会导致负载不均衡
|
||||
>
|
||||
> 模型服务的容量难以评估:
|
||||
> 每个LLM请求的生成长度是未知数,无法准确评估下一时刻模型服务的容量
|
||||
>
|
||||
> 长尾延迟:
|
||||
> 在集群整体水位较低的情况下,也会出现非预期的长尾延迟请求(Prefill聚集)
|
||||
|
||||
|
||||
> 低时延迟:API Server和Turbo对Redis的操作耗时在毫秒级别
|
||||
|
||||
但是每个推理节点主动从全局队列拉时,如何选择拉哪个请求?确实主动拉取的方案能够保证负载均衡和可靠性,但 KVCache 亲和性如何考虑?
|
||||
|
||||
$n$ 个请求时,$k$ 个 instance($k < n$),$O(nk) \to O(n^2)$,若每个节点都需要独立判断 KVCache 亲和性
|
||||
|
||||
只拉指定位置的请求
|
||||
![[250716-012653.png]]
|
||||
|
||||
且每个节点不知道其它节点的状态,容易陷入非全局最优的局面。
|
||||
|
||||
## batch
|
||||
|
||||
> 财年batch api日调用占比商业化模型大盘20%,带动资源利用率提升10%
|
||||
|
||||
技术面临两大挑战
|
||||
|
||||
1. 如何高速稳定的处理用户任务
|
||||
长请求qps要大于2.5万+
|
||||
有状态,考虑ACID,不能丢、不能重复、状态一致、失败重试
|
||||
|
||||
2. 不增加额外成本,不影响在线api的SLO
|
||||
不能为batch准备GPU资源,在线闲时的资源有多少,如何给batch使用
|
||||
|
||||
- 用户任务并发,改为用户模型任务并发,解决用户多模型任务排队问题
|
||||
|
||||
调度策略升级,requestId排队改为batchId排队
|
||||
- 解决查request表慢sql问题,提升整体qps,从1000qps提升至6000+qps
|
||||
- 多个batch同时调度,长短请求混合,减少模型服务突发扩缩容次数
|
||||
- 灵活调度策略,方便设置高优先级用户
|
||||
|
||||
「解决查request表慢sql问题,提升整体qps,从1000qps提升至6000+qps」无法理解!是否说明查 sql 表成为了比 batch 推理更大的 bottleneck?
|
||||
|
||||
|
||||
Q: ToB 场景中 online 和 batch 的需求分别占多少?
|
||||
|
||||
## serverless
|
||||
|
||||
TBD
|
||||
|
||||
|
||||
## 无影 AgentBay
|
||||
|
||||
TBD
|
||||
|
||||
|
||||
---
|
||||
## 杂
|
||||
|
||||
>「目前百炼上架、更新、下架一个基础模型服务需要很多团队配合完成」,每个模型上线需要:商业化研发:设置模型计量、计费配置;测试:模型出账、配置验证
|
||||
|
||||
这件事听起来非常麻烦,是否有统一的设置方案
|
||||
|
||||
|
||||
> PAI为百炼提供模块功能,例如,拓扑感知的POD分配策略。合作的团队不止有PAI,例如模型启动加速的工作也会有更底层的阿里云团队参与。
|
||||
|
||||
|
||||
|
||||
|
||||
# Thinking
|
||||
|
||||
batch 任务和 online 任务处理时的区别,对 MoE 有什么不同的需求
|
||||
|
||||
|
||||
15
projects/moe-autoscaling/Meta Analysis.md
Normal file
@@ -0,0 +1,15 @@
|
||||
- [ ] 在 vllm 上跑起来 Qwen3-32B,使用现有 trace 测试 expert 的 activation pattern
|
||||
延续现有工作:
|
||||
- 不同 workload 下 expert 的 activation pattern 是否有较为显著的区别
|
||||
其它:
|
||||
- 当前模型在真实 trace 下 expert 的负载均衡程度
|
||||
- [ ] EP scaling,EP32 -> EP320,中间状态的效率与 scaling 中的 params 迁移问题
|
||||
- [ ] edge 上常用动态 load 到 GPU 计算(另一种说法:offload 到 CPU 计算),云上在往大 EP 方向发展,大 EP 的必要性?大 EP 需要超大 instance,给 scaling 带来的挑战?
|
||||
- [ ] 多个 MoE 模型的比较
|
||||
- [ ]
|
||||
|
||||
|
||||
---
|
||||
### Background
|
||||
|
||||
From 何涛:Qwen 现在的模型还是一个比较瘦高的结构,EP 不友好(需要多 expert 数量做通信,才有收益),当前线上还是跑的 TP,EP 主要是为下一代模型服务
|
||||
38
projects/moe-autoscaling/Ongoing.md
Normal file
@@ -0,0 +1,38 @@
|
||||
if sent_method not in [
|
||||
+ "determine_num_available_blocks",
|
||||
+ "initialize_cache",
|
||||
+ ]:
|
||||
|
||||
|
||||
|
||||
ray 2.46.0 -> 2.47.1
|
||||
ipconfig -a
|
||||
|
||||
`VLLM_USE_PRECOMPILED=1 pip install --editable .`
|
||||
|
||||
|
||||
```
|
||||
[Credentials]
|
||||
language=EN
|
||||
endpoint=oss-cn-hangzhou.aliyuncs.com
|
||||
accessKeyID=LTAIJO7wLG9y8KJH
|
||||
accessKeySecret=nbx8fIu9B94JoICuKRBhxfSQsMgYeY
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
|
||||
基于 Qwen3-30B(128 experts, 48 layers, activate 8 experts)的测试来看:
|
||||
- 每一层的 expert activation 并没有做到负载均衡,std/mean 的值都接近 1
|
||||
- 最后几层的 std 明显比前面层的 std 大
|
||||
|
||||
|
||||
|
||||
TBD
|
||||
- [ ] 不同 workload 的 expert activation 是否有显著区别
|
||||
- [ ] 相邻层的 expert activation 是否有关联
|
||||
- [ ] temporal pattern 和全局的关联
|
||||
- [ ] 理解 EP 浴盆曲线
|
||||
- [ ] 列个表,survey 现有工作的 points,和我们测试的对比
|
||||
- [ ] reasoning 与 non reasoning 在同一个 session 混合
|
||||
|
||||
BIN
projects/moe-autoscaling/Survey.figs/250611-091821.jpeg
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
projects/moe-autoscaling/Survey.figs/250624-194444.png
Normal file
|
After Width: | Height: | Size: 391 KiB |
BIN
projects/moe-autoscaling/Survey.figs/250624-195519.png
Normal file
|
After Width: | Height: | Size: 304 KiB |
BIN
projects/moe-autoscaling/Survey.figs/250627-111237.png
Normal file
|
After Width: | Height: | Size: 462 KiB |
BIN
projects/moe-autoscaling/Survey.figs/250701-102443.png
Normal file
|
After Width: | Height: | Size: 120 KiB |
109
projects/moe-autoscaling/Survey.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# 现有工作总结
|
||||
|
||||
## Insights
|
||||
|
||||
- 层内 expert 具有 load imbalance 的特性
|
||||
- 层间的 expert 激活具有可预测性
|
||||
- 单个请求在 decoding 过程中的 expert 激活具有 skewness,少部分 expert 会在 decoding 过程中被高频选择
|
||||
|
||||
## expert prefetch 与 cache
|
||||
|
||||
![[250624-194444.png]]
|
||||
- iteration-wise (token):比较 input tokens 的相似性,根据历史记录进行预测(原理:相似的 input tokens 具有相似的 embedding,经过 gate 也应该会得到相似的 expert 选择)
|
||||
- layer-wise (skip):相邻层的 embedding 变化不大,用第 i 层的 expert 预测第 i+1 层的 expert
|
||||
![[250624-195519.png]]
|
||||
|
||||
- 单个请求具有 skewness,多个请求的 statistics 会相对比较均衡
|
||||
- 单个 sequence 的 decoding 过程具有 skewness,< 5% 的 experts 会高频 activate
|
||||
![[250627-111237.png]]
|
||||
|
||||
## expert placement 优化
|
||||
|
||||
$x_{l, e, g} \in \{0, 1\}$,表述第 $l$ 层的 expert $e$ 是否放在 GPU $g$ 上,目标:
|
||||
1. load balance:对任意的 $l_0$,让 $L(g) = \sum_{e} x_{l, e, g} \cdot N_{e}$ 的标准差最小
|
||||
2. communication:$\sum x_{l, e_1, g_1} \cdot x_{l+1, e_2, g_2} \cdot N_{e_1, e_2}$ 最小
|
||||
|
||||
|
||||
## 集合通信优化
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
---
|
||||
# Paper list
|
||||
|
||||
#### [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
|
||||
|
||||
#### [CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](https://arxiv.org/abs/2405.16444)
|
||||
|
||||
- 重算 10%-20% HKVD tokens 就足够把与 $A_{full}$ 的 deviation 降到很低
|
||||
- 不同 layer 之间的 HKVD tokens 相似 -> 第一层完全重算,计算 deviation 得到 HKVD tokens,在第二层对这些 tokens 重算,第二层又可计算 deviation 得到 HKVD tokens 用于下一层,以此类推
|
||||
![[250611-091821.jpeg]]
|
||||
- MoE router 时计算 experts 的概率的 input 为 attention 的 embedding,那么是否相邻层的 embedding 相同,会使得相邻层的 activation pattern 具有相似性?
|
||||
|
||||
#### [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066)
|
||||
|
||||
#### [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134)
|
||||
|
||||
- ploys a small neural network to learn correlations between layer inputs and expert selections
|
||||
|
||||
#### [Accelerating Distributed MoE Training and Inference with Lina](https://arxiv.org/abs/2210.17223)
|
||||
|
||||
#### [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)
|
||||
|
||||
#### [DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency](https://arxiv.org/abs/2408.00741)
|
||||
|
||||
#### [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433)
|
||||
|
||||
#### [Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing](https://arxiv.org/abs/2501.05313)
|
||||
|
||||
Assume $f$ is the token feature vector of a token, in which $f_1$ is the token ID, $f_2$ is the position ID and $f_3$ is the attention ID.
|
||||
![[250701-102443.png]]
|
||||
|
||||
#### [SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](# SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models)
|
||||
|
||||
#### [MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache](https://arxiv.org/abs/2401.14361v3)
|
||||
|
||||
#### [MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing](https://arxiv.org/abs/2502.06643v1)
|
||||
|
||||
- token paths across layers are not random but instead follow structured and predictable patterns
|
||||
- The primary goal of MOETUNER is to develop an expert placement strategy that minimizes two critical factors: the imbalance of token processing load across GPUs and the inter-GPU communication overhead.
|
||||
- 统计前一层激活的 expert $E_{l, i}$ 和后一层激活的 expert $E_{l+1, j}$ 的 pair $\langle E_{l, i}, E_{l+1, j} \rangle$
|
||||
- 该工作没有考虑动态负载的变化
|
||||
|
||||
#### [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https://arxiv.org/abs/2407.04656)
|
||||
|
||||
- Adaptive Expert Allocation
|
||||
- Expert Placement Algorithm
|
||||
- Flexible Token Dispatcher
|
||||
|
||||
#### [Accelerating Mixture-of-Experts Training with Adaptive Expert Replication](https://arxiv.org/abs/2504.19925)
|
||||
|
||||
|
||||
#### [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling]
|
||||
|
||||
- allows the communication and computation tasks in training MoE models to be scheduled in an optimal way
|
||||
- all-to-all collective which better utilizes intra- and inter-connect bandwidths
|
||||
- supports easy extensions of customized all-to-all collectives and data compression approaches
|
||||
|
||||
#### [Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing](https://arxiv.org/abs/2404.16914)
|
||||
|
||||
- defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality” (the loads of each expert are similar in adjacent iterations)
|
||||
|
||||
|
||||
|
||||
- [ ] 总结百炼的架构,分析整体系统有什么可能可以做的
|
||||
- [ ] expert scaling
|
||||
前提:流量 spike,模型切换(数量?)需要启动,模型为 MoE 模型
|
||||
- [ ] train 和 inference 的 A2A 的区别?和 deepep 的区别?
|
||||
- [ ] 容错问题?本质 GPU 数量增多,更容易挂,EP 是不是最容易容错的?kvcache 的容错
|
||||
- [ ] failure 的概率?
|
||||
- [ ] replication 除了容错,能服务什么
|
||||
|
||||
|
||||
|
||||
https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs
|
||||
BIN
projects/moe-autoscaling/Sync.figs/250624-105430.png
Normal file
|
After Width: | Height: | Size: 759 KiB |
BIN
projects/moe-autoscaling/Sync.figs/250624-105431.png
Normal file
|
After Width: | Height: | Size: 749 KiB |
BIN
projects/moe-autoscaling/Sync.figs/250624-152355.png
Normal file
|
After Width: | Height: | Size: 813 KiB |
BIN
projects/moe-autoscaling/Sync.figs/250624-152444.png
Normal file
|
After Width: | Height: | Size: 821 KiB |
105
projects/moe-autoscaling/Sync.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# OKR
|
||||
|
||||
## Objectives
|
||||
|
||||
1. MoE pattern feature
|
||||
2. EP design for inference performance
|
||||
|
||||
## Key results
|
||||
|
||||
- [ ] 现有工作的激活 pattern,与真实 trace 测试结论的对比 [O1]
|
||||
- [ ] load imbalance 的 temporal locality,workload (reasoning/...)
|
||||
- [ ] 层间 load imbalance 的 correlation
|
||||
- [ ] Models 是否 sensitive:Qwen3-30B/235B, DeepSeek-671B [O1]
|
||||
- [ ] EP 是否需要 dynamic,dynamic EP 扩缩容对性能的影响 [O2]
|
||||
|
||||
---
|
||||
# 0617
|
||||
|
||||
- 在 8 * H800 上跑起来 Qwen3-235B,进行 expert activation 的 trace
|
||||
- 代码开发:distributed 在当前版本的 vllm 会 fallback 到 v0,之前测试 30B 模型(num_expert=48)为单进程,默认开启 v1,完成了在 distributed 上的代码重写
|
||||
- 测试得到一份 235B model 的 expert activation trace,尚未进行 data analysis
|
||||
- 跑 DeepSeek-671B
|
||||
- 完成了 Ray 版本 & DeepSeek 模型的 expert activation trace 部分的 vllm 代码
|
||||
- 学习 Ray
|
||||
- 使用 Ray 运行满血 DeepSeek,存在的问题
|
||||
- 默认的 vllm 尚且不能正常跑起来(怀疑是最新的 main 分支存在 bug),checkout 回 0.9.1 版本,rebuilding wheel...
|
||||
- 基本的 expert activation temporal pattern 分析
|
||||
- 每 5min 作为一个 bin,与全局 1h 的 TopK expert IDs 的 Jaccard 相似度比较(取 K=8)
|
||||
![[250624-105430.png]]
|
||||
- 每 5min 作为一个 bin,前 5min 与后 5min 的 Jaccard 相似度比较
|
||||
![[250624-105431.png]]
|
||||
|
||||
- [ ] 完成 671B 模型的测试
|
||||
- [ ] 测试 ShareGPT
|
||||
- [ ] expert 部分 同步 or 异步
|
||||
- Request defer 对 GPU memory 的影响(GPU coroutine)
|
||||
- [ ] 从 **异步**/load balance/colocation 出发,能给出哪些观察
|
||||
- [ ] 动态重排 experts 的 cost?
|
||||
|
||||
---
|
||||
# 0624
|
||||
|
||||
- 运行起 DeepSeek-671B,并且可以 trace expert activation pattern
|
||||
- 当前存在的问题:16 * H800 中测试 DeepSeek 时只能使用受限的 prompt 长度(< 1024)
|
||||
- 分析 Qwen-235B 的 expert pattern
|
||||
- 与全局 1h 的 TopK expert IDs 的 Jaccard 相似度比较(取 K=8)
|
||||
$$
|
||||
\text{Jaccard(A, B)} = \frac{|A \cap B|}{|A \cup B|}
|
||||
$$
|
||||
![[250624-152355.png]]
|
||||
- 前 5min 与后 5min 的 Jaccard 相似度比较
|
||||
![[250624-152444.png]]
|
||||
|
||||
|
||||
---
|
||||
# 0701
|
||||
|
||||
[[Survey]]
|
||||
|
||||
- [ ] 总结百炼的架构,分析整体系统有什么可能可以做的
|
||||
- [ ] expert scaling
|
||||
前提:流量 spike,模型切换(数量?)需要启动,模型为 MoE 模型
|
||||
- [ ] train 和 inference 的 A2A 的区别?和 deepep 的区别?
|
||||
- [ ] 容错问题?本质 GPU 数量增多,更容易挂,EP 是不是最容易容错的?kvcache 的容错
|
||||
- [ ] failure 的概率?
|
||||
- [ ] replication 除了容错,能服务什么
|
||||
|
||||
GPU serverless?
|
||||
|
||||
---
|
||||
# 0715
|
||||
|
||||
[[Bailian Arch]]
|
||||
|
||||
|
||||
---
|
||||
# 0722
|
||||
|
||||
- 老 trace ToB/ToC 混合
|
||||
- [[Bailian Arch]]
|
||||
- 不同并行模式的 scheduling
|
||||
同一模型会有不同的 parallel setup,global schedule 可以做当前 request 的特点做 route,选择最合适的 parallel setup 的 instance,$P = f(req, queue, state)$
|
||||
- 不同 parallel setup 对请求类型/长短的影响体现在哪里?
|
||||
- 考虑 global 做 batch 时,一个 batch 对 parallel setup 的亲和性如何考虑?如何做 batch?如何做分发?
|
||||
|
||||
现状:一个 model serving 会有多个部署,每个部署内部一般还是相同的 parallelism,但是不同的部署可能会采用不同的 parallelism,目前线上的主要区别来自于:在线请求与 batch 请求,batch 请求由于类型不同/SLO 需求不同,可能会采用不同的 parallelism mode 分别满足不同的 SLO 需求(thpt/latency)
|
||||
|
||||
|
||||
---
|
||||
# 0729
|
||||
|
||||
[[Trace-Qwen3]]
|
||||
[[Heterogenous Parallelism Cluster]]
|
||||
|
||||
Transfer to project: heterogenous parallelism
|
||||
|
||||
---
|
||||
# TBD
|
||||
|
||||
为什么 MoE 能减少 attention head?
|
||||
|
||||
agent 场景下,master 的 KVCache 动态变化,如何容错,简单的 replica?
|
||||
MoE 如何容错,expert re-route?
|
||||
|
||||
|
||||