Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
--- a/projects/moe-autoscaling/Bailian
+++ b/projects/moe-autoscaling/Bailian
@@ -0,0 +1,130 @@
+# 名词解释
+
+- **CPFS**: Cloud Parallel File Storage. CPFS provides a unified namespace and supports concurrent access of hundreds of machines. CPFS also provides an I/O throughput of tens of GB/s and millions of IOPS to ensure a sub-millisecond latency.
+- **ODPS (MaxCompute)**: MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
+- **CEN**: Cloud Enterprise Network. CEN uses transit routers deployed in different regions to build a full-mesh network on top of the Alibaba Cloud global transmission network.
+- **BladeLLM**: BladeLLM is an inference engine tailored for large language model (LLM) optimization and high-performance model deployment. BladeLLM has an advanced technical architecture and provides a user-friendly interface and outstanding performance to address new opportunities and challenges in LLM fields. This makes BladeLLM a suitable choice for enterprises that want to deploy LLMs and use the LLMs to perform inference.
+
+# Arch
+
+![[250716-011140.png]]
+
+百炼的资源池的计算任务有 4 层：
+- 实时推理任务
+- Batch 推理任务
+- 闲时训练任务
+- 闲时数据处理任务
+
+
+## 网关
+
+接口协议层面提供了标准的 HTTP，HTTP SSE 和 Websocket 的接入
+用户 API 的接入、鉴权、路由、限流、协议转换、计量、计费（部分）、内容安全和所有与之配套的业务逻辑
+网关层同时作为平台的基础服务层，提供 Apikey、文件、用户空间、资源权限等基础服务能力
+
+
+## model serving
+
+重构前：每个 model 部署一个 allinone，简单理解为一个 model 带着所有的 agent/plugin 作为一个最小单元进行部署。
+![[250715-192813.png]]
+
+重构后：每个 agent/plugin 单独部署（必然趋势），那么确实会存在 gateway -> agent1 -> gateway -> agent2 -> ... 这种不断通信的模式。尤其在 model 的 output 越来越长、模态种类越来越多（图片、视频）的场景下，通信开销确实不可忽视。
+
+Q: 不同 agent 之间的数据如何传递？都需要经过网关再发给另一端吗？是否存在一个 global storage 存放不同 agent 的 output，实现共享（那么也需要考虑 global storage 内的隔离）
+
+![[250715-135101.png]]
+
+## VL model
+
+通过一系列一体化工作来减少图片/视频的传输，否则所有的图片/视频数据在经过每个节点时都需要传输，带来大量的开销。
+
+![[250715-202224.png]]
+
+![[250715-202246.png]]
+
+![[250715-202304.png]]
+
+
+
+## global batching
+
+方案：
+
+> 在Global Dynamic Batching的设计上，我们选择使用拉模式来进行调度，一句话概括为API Server先将请求发送至Redis全局队列，推理节点（Model Serving）从全局队列中拉取请求。
+
+Claim:
+
+> RR模式（推）负载不均衡：
+>     RR模式所有请求在调度层的视角下是相同的，LLM请求的差异性会导致负载不均衡
+>
+> 模型服务的容量难以评估：
+>     每个LLM请求的生成长度是未知数，无法准确评估下一时刻模型服务的容量
+>
+> 长尾延迟：
+>     在集群整体水位较低的情况下，也会出现非预期的长尾延迟请求（Prefill聚集）
+
+
+> 低时延迟：API Server和Turbo对Redis的操作耗时在毫秒级别
+
+但是每个推理节点主动从全局队列拉时，如何选择拉哪个请求？确实主动拉取的方案能够保证负载均衡和可靠性，但 KVCache 亲和性如何考虑？
+
+$n$ 个请求时，$k$ 个 instance（$k < n$），$O(nk) \to O(n^2)$，若每个节点都需要独立判断 KVCache 亲和性
+
+只拉指定位置的请求
+![[250716-012653.png]]
+
+且每个节点不知道其它节点的状态，容易陷入非全局最优的局面。
+
+## batch
+
+> 财年batch api日调用占比商业化模型大盘20%，带动资源利用率提升10％
+
+技术面临两大挑战
+
+1. 如何高速稳定的处理用户任务
+	长请求qps要大于2.5万+
+	有状态，考虑ACID，不能丢、不能重复、状态一致、失败重试
+
+2. 不增加额外成本，不影响在线api的SLO
+	不能为batch准备GPU资源，在线闲时的资源有多少，如何给batch使用
+
+- 用户任务并发，改为用户模型任务并发，解决用户多模型任务排队问题
+
+调度策略升级，requestId排队改为batchId排队
+- 解决查request表慢sql问题，提升整体qps，从1000qps提升至6000+qps
+- 多个batch同时调度，长短请求混合，减少模型服务突发扩缩容次数
+- 灵活调度策略，方便设置高优先级用户
+
+「解决查request表慢sql问题，提升整体qps，从1000qps提升至6000+qps」无法理解！是否说明查 sql 表成为了比 batch 推理更大的 bottleneck？
+
+
+Q: ToB 场景中 online 和 batch 的需求分别占多少？
+
+## serverless
+
+TBD
+
+
+## 无影 AgentBay
+
+TBD
+
+
+---
+## 杂
+
+>「目前百炼上架、更新、下架一个基础模型服务需要很多团队配合完成」，每个模型上线需要：商业化研发：设置模型计量、计费配置；测试：模型出账、配置验证
+
+这件事听起来非常麻烦，是否有统一的设置方案
+
+
+> PAI为百炼提供模块功能，例如，拓扑感知的POD分配策略。合作的团队不止有PAI，例如模型启动加速的工作也会有更底层的阿里云团队参与。
+
+
+
+
+# Thinking
+
+batch 任务和 online 任务处理时的区别，对 MoE 有什么不同的需求
+
+
--- a/projects/moe-autoscaling/Meta
+++ b/projects/moe-autoscaling/Meta
@@ -0,0 +1,15 @@
+- [ ] 在 vllm 上跑起来 Qwen3-32B，使用现有 trace 测试 expert 的 activation pattern
+	延续现有工作：
+	- 不同 workload 下 expert 的 activation pattern 是否有较为显著的区别
+	其它：
+	- 当前模型在真实 trace 下 expert 的负载均衡程度
+- [ ] EP scaling，EP32 -> EP320，中间状态的效率与 scaling 中的 params 迁移问题
+- [ ] edge 上常用动态 load 到 GPU 计算（另一种说法：offload 到 CPU 计算），云上在往大 EP 方向发展，大 EP 的必要性？大 EP 需要超大 instance，给 scaling 带来的挑战？
+- [ ] 多个 MoE 模型的比较
+- [ ] 
+
+
+---
+### Background
+
+From 何涛：Qwen 现在的模型还是一个比较瘦高的结构，EP 不友好（需要多 expert 数量做通信，才有收益），当前线上还是跑的 TP，EP 主要是为下一代模型服务
--- a/projects/moe-autoscaling/Ongoing.md
+++ b/projects/moe-autoscaling/Ongoing.md
@@ -0,0 +1,38 @@
+if sent_method not in [
+            "determine_num_available_blocks",
+            "initialize_cache",
+        ]:
+
+
+
+ray 2.46.0 -> 2.47.1
+ipconfig -a
+
+`VLLM_USE_PRECOMPILED=1 pip install --editable .`
+
+
+```
+[Credentials]
+language=EN
+endpoint=oss-cn-hangzhou.aliyuncs.com
+accessKeyID=LTAIJO7wLG9y8KJH
+accessKeySecret=nbx8fIu9B94JoICuKRBhxfSQsMgYeY
+```
+
+
+---
+
+基于 Qwen3-30B（128 experts, 48 layers, activate 8 experts）的测试来看：
+- 每一层的 expert activation 并没有做到负载均衡，std/mean 的值都接近 1
+- 最后几层的 std 明显比前面层的 std 大
+
+
+
+TBD
+- [ ] 不同 workload 的 expert activation 是否有显著区别
+- [ ] 相邻层的 expert activation 是否有关联
+- [ ] temporal pattern 和全局的关联
+- [ ] 理解 EP 浴盆曲线
+- [ ] 列个表，survey 现有工作的 points，和我们测试的对比
+- [ ] reasoning 与 non reasoning 在同一个 session 混合
+
--- a/projects/moe-autoscaling/Survey.figs/250611-091821.jpeg
+++ b/projects/moe-autoscaling/Survey.figs/250611-091821.jpeg
--- a/projects/moe-autoscaling/Survey.figs/250624-194444.png
+++ b/projects/moe-autoscaling/Survey.figs/250624-194444.png
--- a/projects/moe-autoscaling/Survey.figs/250624-195519.png
+++ b/projects/moe-autoscaling/Survey.figs/250624-195519.png
--- a/projects/moe-autoscaling/Survey.figs/250627-111237.png
+++ b/projects/moe-autoscaling/Survey.figs/250627-111237.png
--- a/projects/moe-autoscaling/Survey.figs/250701-102443.png
+++ b/projects/moe-autoscaling/Survey.figs/250701-102443.png
--- a/projects/moe-autoscaling/Survey.md
+++ b/projects/moe-autoscaling/Survey.md
@@ -0,0 +1,109 @@
+# 现有工作总结
+
+## Insights
+
+- 层内 expert 具有 load imbalance 的特性
+- 层间的 expert 激活具有可预测性
+- 单个请求在 decoding 过程中的 expert 激活具有 skewness，少部分 expert 会在 decoding 过程中被高频选择
+
+## expert prefetch 与 cache
+
+![[250624-194444.png]]
+- iteration-wise (token)：比较 input tokens 的相似性，根据历史记录进行预测（原理：相似的 input tokens 具有相似的 embedding，经过 gate 也应该会得到相似的 expert 选择）
+- layer-wise (skip)：相邻层的 embedding 变化不大，用第 i 层的 expert 预测第 i+1 层的 expert
+![[250624-195519.png]]
+
+- 单个请求具有 skewness，多个请求的 statistics 会相对比较均衡
+- 单个 sequence 的 decoding 过程具有 skewness，< 5% 的 experts 会高频 activate
+![[250627-111237.png]]
+
+## expert placement 优化
+
+$x_{l, e, g} \in \{0, 1\}$，表述第 $l$ 层的 expert $e$ 是否放在 GPU $g$ 上，目标：
+1. load balance：对任意的 $l_0$，让 $L(g) = \sum_{e} x_{l, e, g} \cdot N_{e}$ 的标准差最小
+2. communication：$\sum x_{l, e_1, g_1} \cdot x_{l+1, e_2, g_2} \cdot N_{e_1, e_2}$ 最小
+
+
+## 集合通信优化
+
+
+
+
+
+
+
+
+---
+# Paper list
+
+#### [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
+
+#### [CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](https://arxiv.org/abs/2405.16444)
+
+- 重算 10%-20% HKVD tokens 就足够把与 $A_{full}$ 的 deviation 降到很低
+- 不同 layer 之间的 HKVD tokens 相似 -> 第一层完全重算，计算 deviation 得到 HKVD tokens，在第二层对这些 tokens 重算，第二层又可计算 deviation 得到 HKVD tokens 用于下一层，以此类推
+![[250611-091821.jpeg]]
+- MoE router 时计算 experts 的概率的 input 为 attention 的 embedding，那么是否相邻层的 embedding 相同，会使得相邻层的 activation pattern 具有相似性？
+
+#### [DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models](https://arxiv.org/abs/2401.06066)
+
+#### [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134)
+
+- ploys a small neural network to learn correlations between layer inputs and expert selections
+
+#### [Accelerating Distributed MoE Training and Inference with Lina](https://arxiv.org/abs/2210.17223)
+
+#### [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)
+
+#### [DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency](https://arxiv.org/abs/2408.00741)
+
+#### [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433)
+
+#### [Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing](https://arxiv.org/abs/2501.05313)
+
+Assume $f$ is the token feature vector of a token, in which $f_1$ is the token ID, $f_2$ is the position ID and $f_3$ is the attention ID.
+![[250701-102443.png]]
+
+#### [SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](# SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models)
+
+#### [MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache](https://arxiv.org/abs/2401.14361v3)
+
+#### [MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing](https://arxiv.org/abs/2502.06643v1)
+
+- token paths across layers are not random but instead follow structured and predictable patterns
+- The primary goal of MOETUNER is to develop an expert placement strategy that minimizes two critical factors: the imbalance of token processing load across GPUs and the inter-GPU communication overhead.
+- 统计前一层激活的 expert $E_{l, i}$ 和后一层激活的 expert $E_{l+1, j}$ 的 pair $\langle E_{l, i}, E_{l+1, j} \rangle$
+- 该工作没有考虑动态负载的变化
+
+#### [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https://arxiv.org/abs/2407.04656)
+
+- Adaptive Expert Allocation
+- Expert Placement Algorithm
+- Flexible Token Dispatcher
+
+#### [Accelerating Mixture-of-Experts Training with Adaptive Expert Replication](https://arxiv.org/abs/2504.19925)
+
+
+#### [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling]
+
+- allows the communication and computation tasks in training MoE models to be scheduled in an optimal way
+- all-to-all collective which better utilizes intra- and inter-connect bandwidths
+- supports easy extensions of customized all-to-all collectives and data compression approaches
+
+#### [Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing](https://arxiv.org/abs/2404.16914)
+
+- defined the transient state with “obvious load fluctuation” and the stable state with “temporal locality” (the loads of each expert are similar in adjacent iterations)
+
+
+
+- [ ] 总结百炼的架构，分析整体系统有什么可能可以做的
+- [ ] expert scaling
+	前提：流量 spike，模型切换（数量？）需要启动，模型为 MoE 模型
+- [ ] train 和 inference 的 A2A 的区别？和 deepep 的区别？
+- [ ] 容错问题？本质 GPU 数量增多，更容易挂，EP 是不是最容易容错的？kvcache 的容错
+	- [ ] failure 的概率？
+	- [ ] replication 除了容错，能服务什么
+
+
+
+https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs
--- a/projects/moe-autoscaling/Sync.figs/250624-105430.png
+++ b/projects/moe-autoscaling/Sync.figs/250624-105430.png
--- a/projects/moe-autoscaling/Sync.figs/250624-105431.png
+++ b/projects/moe-autoscaling/Sync.figs/250624-105431.png
--- a/projects/moe-autoscaling/Sync.figs/250624-152355.png
+++ b/projects/moe-autoscaling/Sync.figs/250624-152355.png
--- a/projects/moe-autoscaling/Sync.figs/250624-152444.png
+++ b/projects/moe-autoscaling/Sync.figs/250624-152444.png
--- a/projects/moe-autoscaling/Sync.md
+++ b/projects/moe-autoscaling/Sync.md
@@ -0,0 +1,105 @@
+# OKR
+
+## Objectives
+
+1. MoE pattern feature
+2. EP design for inference performance
+
+## Key results
+
+- [ ] 现有工作的激活 pattern，与真实 trace 测试结论的对比 [O1]
+	- [ ] load imbalance 的 temporal locality，workload (reasoning/...)
+	- [ ] 层间 load imbalance 的 correlation
+- [ ] Models 是否 sensitive：Qwen3-30B/235B, DeepSeek-671B [O1]
+- [ ] EP 是否需要 dynamic，dynamic EP 扩缩容对性能的影响 [O2]
+
+---
+# 0617
+
+- 在 8 * H800 上跑起来 Qwen3-235B，进行 expert activation 的 trace
+	- 代码开发：distributed 在当前版本的 vllm 会 fallback 到 v0，之前测试 30B 模型（num_expert=48）为单进程，默认开启 v1，完成了在 distributed 上的代码重写
+	- 测试得到一份 235B model 的 expert activation trace，尚未进行 data analysis
+- 跑 DeepSeek-671B
+	- 完成了 Ray 版本 & DeepSeek 模型的 expert activation trace 部分的 vllm 代码
+	- 学习 Ray
+	- 使用 Ray 运行满血 DeepSeek，存在的问题
+		- 默认的 vllm 尚且不能正常跑起来（怀疑是最新的 main 分支存在 bug），checkout 回 0.9.1 版本，rebuilding wheel...
+- 基本的 expert activation temporal pattern 分析
+	- 每 5min 作为一个 bin，与全局 1h 的 TopK expert IDs 的 Jaccard 相似度比较（取 K=8）
+		![[250624-105430.png]]
+	- 每 5min 作为一个 bin，前 5min 与后 5min 的 Jaccard 相似度比较
+		![[250624-105431.png]]
+
+- [ ] 完成 671B 模型的测试
+- [ ] 测试 ShareGPT
+- [ ] expert 部分 同步 or 异步
+	- Request defer 对 GPU memory 的影响（GPU coroutine）
+- [ ] 从 **异步**/load balance/colocation 出发，能给出哪些观察
+- [ ] 动态重排 experts 的 cost？
+
+---
+# 0624
+
+- 运行起 DeepSeek-671B，并且可以 trace expert activation pattern
+	- 当前存在的问题：16 * H800 中测试 DeepSeek 时只能使用受限的 prompt 长度（< 1024）
+- 分析 Qwen-235B 的 expert pattern
+	- 与全局 1h 的 TopK expert IDs 的 Jaccard 相似度比较（取 K=8）
+	$$
+	\text{Jaccard(A, B)} = \frac{|A \cap B|}{|A \cup B|}
+	$$
+	![[250624-152355.png]]
+	- 前 5min 与后 5min 的 Jaccard 相似度比较
+	![[250624-152444.png]]
+
+
+---
+# 0701
+
+[[Survey]]
+
+- [ ] 总结百炼的架构，分析整体系统有什么可能可以做的
+- [ ] expert scaling
+	前提：流量 spike，模型切换（数量？）需要启动，模型为 MoE 模型
+- [ ] train 和 inference 的 A2A 的区别？和 deepep 的区别？
+- [ ] 容错问题？本质 GPU 数量增多，更容易挂，EP 是不是最容易容错的？kvcache 的容错
+	- [ ] failure 的概率？
+	- [ ] replication 除了容错，能服务什么
+
+GPU serverless?
+
+---
+# 0715
+
+[[Bailian Arch]]
+
+
+---
+# 0722
+
+- 老 trace ToB/ToC 混合
+- [[Bailian Arch]]
+- 不同并行模式的 scheduling
+	同一模型会有不同的 parallel setup，global schedule 可以做当前 request 的特点做 route，选择最合适的 parallel setup 的 instance，$P = f(req, queue, state)$
+	- 不同 parallel setup 对请求类型/长短的影响体现在哪里？
+	- 考虑 global 做 batch 时，一个 batch 对 parallel setup 的亲和性如何考虑？如何做 batch？如何做分发？
+	
+	现状：一个 model serving 会有多个部署，每个部署内部一般还是相同的 parallelism，但是不同的部署可能会采用不同的 parallelism，目前线上的主要区别来自于：在线请求与 batch 请求，batch 请求由于类型不同/SLO 需求不同，可能会采用不同的 parallelism mode 分别满足不同的 SLO 需求（thpt/latency）
+
+
+---
+# 0729
+
+[[Trace-Qwen3]]
+[[Heterogenous Parallelism Cluster]]
+
+Transfer to project: heterogenous parallelism
+
+---
+# TBD
+
+为什么 MoE 能减少 attention head？
+
+agent 场景下，master 的 KVCache 动态变化，如何容错，简单的 replica？
+MoE 如何容错，expert re-route？
+
+