Initial commit: obsidian to gitea
BIN
phd/papers/IMPRESS.figs/250228-095457.png
Normal file
|
After Width: | Height: | Size: 316 KiB |
BIN
phd/papers/IMPRESS.figs/250228-164104.png
Normal file
|
After Width: | Height: | Size: 412 KiB |
BIN
phd/papers/IMPRESS.figs/250228-170902.png
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
phd/papers/IMPRESS.figs/250228-175702.png
Normal file
|
After Width: | Height: | Size: 317 KiB |
BIN
phd/papers/IMPRESS.figs/250228-175831.png
Normal file
|
After Width: | Height: | Size: 216 KiB |
BIN
phd/papers/IMPRESS.figs/250301-134529.png
Normal file
|
After Width: | Height: | Size: 307 KiB |
BIN
phd/papers/IMPRESS.figs/250301-151846.png
Normal file
|
After Width: | Height: | Size: 314 KiB |
BIN
phd/papers/IMPRESS.figs/250301-155139.png
Normal file
|
After Width: | Height: | Size: 886 KiB |
BIN
phd/papers/IMPRESS.figs/250301-155325.png
Normal file
|
After Width: | Height: | Size: 378 KiB |
BIN
phd/papers/IMPRESS.figs/250301-155822.png
Normal file
|
After Width: | Height: | Size: 179 KiB |
BIN
phd/papers/IMPRESS.figs/250301-155959.png
Normal file
|
After Width: | Height: | Size: 181 KiB |
BIN
phd/papers/IMPRESS.figs/250301-160458.png
Normal file
|
After Width: | Height: | Size: 247 KiB |
BIN
phd/papers/IMPRESS.figs/250301-162611.png
Normal file
|
After Width: | Height: | Size: 451 KiB |
139
phd/papers/IMPRESS.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# TL;DR
|
||||
|
||||
该工作观察到,KV cache 从 SSD load 到 GPU 的开销很难被 overlap,load 的时间开销占据了 TTFT 的 51%-98%。该工作的出发点是利用 attention 的稀疏性,只 load important KV cache。这就出现了两个 challenges,C1: 如何决定 SSD 上的 KV cache 哪些是 important 的,C2: load important KV cache 时是离散的,带宽利用不高效,基于 KV cache 的稀疏性,现有的 cache policy 也不高效。该工作利用不同 head 间 important tokens 的相似性解决 C1,利用 SSD 上的 KV reorder 和 importance score based policy 解决 C2.
|
||||
|
||||
# Motivations
|
||||
|
||||
不同 tokens 对应的 KV 具有不同的重要程度,因此我们不需要 load 所有 prefix tokens 的 KV,而是只 load important KV,减少 IO 导致的 overhead。
|
||||
|
||||
# Challenges
|
||||
|
||||
## 现存 important KV 的判别法依然会引入巨大的 IO overhead
|
||||
|
||||
最 naive 的方式,load 所有 KV,计算 attention,判断哪些是 important 的。但是既然所有 tokens 都算完了,为什么还要用 important,无任何收益 ✖️
|
||||
|
||||
$\mathrm{attention}(Q, K, V) = \mathrm{softmax}(QK^T)V$ 的稀疏性来源于 $\mathrm{softmax}(QK^T)$ 的稀疏性,因此现有工作只 load 所有的 keys,开销少了一半,但仍然很大 ✖️
|
||||
|
||||
记录一段 prefix tokens 哪些是 important tokens,下次 reuse prefix 时直接使用,accuracy 不行,例如 RAG,同一段 prefix tokens(docs),在后面跟不同 query 时,important tokens 是不一样的 ✖️
|
||||
![[250228-095457.png]]
|
||||
|
||||
## 现有系统不能高效利用 important KV
|
||||
|
||||
为了高效地 disk read 和利用 PCIe,现有系统会以 chunk 为单位 read。而一个 chunk 会包含 important KVs 和 unimportant KVs,导致 read amplification,影响性能。
|
||||
|
||||
现有的 KV cache manager 都是利用 LRU/LFU 这种策略,忽略了不同 KV cache 重要性的区别。可能导致不关键的 KV cache 占据着 GPU memory。降低了 important KV cache 的 hit ratio,或者导致更多的 transfer 开销。
|
||||
|
||||
# Key Observation
|
||||
|
||||
## 同一 layer 不同 head 之间的 important tokens 具有相似性
|
||||
|
||||
对于两个 set $A, B$,Jaccard Scores 用于定义相似性 $S = \frac{|A \cap B|}{|A \cup B|}$。例如下图 a,相似性为 $\frac{1}{3}$,该工作通过实验 32 heads,交叉的相似性都在 0.9 以上。
|
||||
![[250228-164104.png]]
|
||||
|
||||
## 不同 important KV 的比例、不同模型的大小下都具有相似性
|
||||
|
||||
important KV 的比例的意思是,不同的比例 $r$,我们认为 $r \cdot \text{num\_tokens}$ 个 tokens 对应的 KV 为 important KV。
|
||||
![[250228-170902.png]]
|
||||
文中并没有解释、为什么只选 top 10% 的 KV 时,在后一半的 layers 中,相似度明显下降。
|
||||
|
||||
> Although smaller models and deeper transformer layers tend to exhibit lower similarities, they are still significantly higher than the expected value from random selection in most cases.
|
||||
|
||||
# Solution
|
||||
|
||||
## For C1, from O1
|
||||
|
||||
既然同一 layer 的 heads 之间的 important KV 具有相似性,可以选择部分 heads 作为探针(probe heads),计算它们的 important KV,代表该 layer 的 important KV。
|
||||
|
||||
具体来说,会选取前 3 个 heads(原文 claim:the choice of which three heads to use has no impact on accuracy due to the similarity)作为 probe heads。load 它们的 keys,计算 important tokens id set,并计算它们之间的平均相似性。
|
||||
|
||||
如果在相似性超过 threshold,则返回 3 个 tokens id set 中的其中一个,作为全局的 important tokens id set(原文没说怎么在 3 个中选一个,有可能是 random 选一个,或是选择与另两个 head 的平均相似度最高的那个 head)。
|
||||
|
||||
如果相似性不足,则 fallback 到默认策略(load all)。实验证明,只有不超过 20% 的 layers 需要 fallback。
|
||||
|
||||
![[250228-175702.png]]
|
||||
|
||||
### 神秘的超参选择
|
||||
|
||||
> Selecting only one probe head to determine the most important token index may introduce bias, affecting model accuracy. Using two probe heads might fail to identify the most important index through voting when disagreements arise. Therefore, we choose to use three probe heads. Increasing the number of probe heads offers minimal improvements in accuracy but increases the keys loading time, thereby extending the TTFT.
|
||||
>
|
||||
> ![[250228-175831.png]]
|
||||
>
|
||||
> $n$ prefix tokens 选择 $k$ important tokens,对于两个任意的 $n$ 选 $k$ 的 token id set $A$ 和 $B$
|
||||
> $$j = E(\text{Jaccard(A, B)}) = \frac{E(A \cap B)}{E(A \cup B)} = \frac{E(A \cap B)}{E(A) + E(B) - E(A \cup B)} = \frac{n \cdot \frac{k^2}{n^2}}{2k - n \cdot \frac{k^2}{n^2}} = \frac{\frac{k}{n}}{2 - \frac{k}{n}}$$
|
||||
> 选择 $t = j^{\alpha}$ 作为 threshold,通过实验,$\alpha$ 选 0.6 最好
|
||||
|
||||
## For C2
|
||||
|
||||
### KV cache reorder
|
||||
|
||||
![[250301-134529.png]]
|
||||
- 一个线程周期性地(10 分钟)做一个 block(radix node)内的 reorder,reorder based on the average token importance
|
||||
- 🤔 过去的 token importance 为什么能指导未来的情况?
|
||||
- 不在不同 node 间 reorder 的理由:做 prefix match 时,粒度更大,会需要 load 更多的无效 KV cache,导致 IO overhead
|
||||
|
||||
### Score-based Cache Management
|
||||
|
||||
每个 chunk 的分数由 access frequency 和 important tokens 的占比共同决定。
|
||||
|
||||
![[250301-151846.png]]
|
||||
|
||||
在 GPU 和 CPU 分别使用一个 min-heap 维护 score,同时保证 GPU 和 CPU 上没有 overlap 的 KV,但是在 disk 上会有全量 KV 的 replica,减少 CPU -> disk 做 evict 时的 IO 开销
|
||||
|
||||
# Evaluation
|
||||
|
||||
## Setup
|
||||
|
||||
- models
|
||||
OPT- 6.7B, OPT-13B, and OPT-30B
|
||||
- hardware
|
||||
2 * AMD EPYC 7763 CPUs (64 cores)
|
||||
1 * 128 GB DRAM
|
||||
1 * NVIDIA A100 GPU with 80GB HBM
|
||||
1 * 2TB Intel SSD whose measured read throughput is around 5GB/s
|
||||
PCIe 4.0 * 16 for GPU and CPU connection
|
||||
测试时限制 GPU 上使用 10GB for prefix cache,CPU 上使用 32GB
|
||||
- dataset
|
||||
PIQA, RTE, COPA, and OpenBookQA
|
||||
|
||||
## Baseline
|
||||
|
||||
1. ReComp: recompute all prefix KV cache
|
||||
2. AS-like: 作者自己实现的 AttentionStore
|
||||
3. AS+H2O+LRU
|
||||
4. AS+H2O+LFU
|
||||
|
||||
## Results
|
||||
|
||||
### Model generation quality
|
||||
|
||||
![[250301-155139.png]]
|
||||
|
||||
### TTFT
|
||||
|
||||
1.2x to 2.8x improvement
|
||||
![[250301-155325.png]]
|
||||
|
||||
1.5x to 3.8x reduction in I/O time
|
||||
![[250301-155822.png]]
|
||||
|
||||
### Ablation experiment
|
||||
|
||||
ITF: similarity-guided important token identification
|
||||
RO: reorder
|
||||
All: ITF + RO + score-based cache management
|
||||
![[250301-155959.png]]
|
||||
|
||||
![[250301-160458.png]]
|
||||
|
||||
### Sensitivity Analysis
|
||||
|
||||
1. alpha
|
||||
2. chunk size
|
||||
3. dataset size: 为什么增大 dataset 后,提升效果相对减弱了?只测试到 400GB,如果继续增大是不是对比起来就没有任何提升了?
|
||||
4. model type
|
||||
|
||||
![[250301-162611.png]]
|
||||
|
||||
# Takeaway
|
||||
|
||||
- 个人认为这篇工作的优点:paper 写的很通顺、有很多简单的例子帮助读者 follow 它们的 idea,实验做的比较详细
|
||||
|
After Width: | Height: | Size: 69 KiB |
@@ -0,0 +1,47 @@
|
||||
## Ghost in the Android Shell: Pragmatic Test-oracle Specification of a Production Hypervisor
|
||||
|
||||
> Kayvan Memarian, Ben Simner, David Kaloper-Meršinjak, Thibaut Pérami, Peter Sewell (University of Cambridge)
|
||||
|
||||
### 一、背景
|
||||
|
||||
在现代操作系统和虚拟化平台中,**hypervisor(虚拟机监控器)** 是实现安全隔离的核心组件。本文的讨论的对象 **pKVM**(protected KVM)是 Google 在 Android 上部署的生产级 hypervisor,用于保护 Android host 与 guest 虚拟机之间的隔离。传统的 hypervisor 开发通常依赖**测试 + 手工推理**;而要实现严格的安全性保障,过去的做法是**形式化验证**,但如 seL4、CertiKOS、IronFleet 等方法对大多数生产环境对开发团队来说成功过高。因此,本文提出了一种更轻量的中间方案:通过“**可执行规范 + 运行时测试**”的方式构建对生产 hypervisor 的信任。
|
||||
|
||||
### 二、问题与挑战
|
||||
|
||||
要在真实的生产 hypervisor(pKVM)中使用“定义规范 + 运行时测试”方法,存在一系列的挑战:
|
||||
|
||||
- pKVM 的行为与底层 Arm 硬件架构(页表、异常处理)密切绑定,规范必须能描述这种隐式行为(如硬件页表 walk)。
|
||||
- 多个硬件线程可能同时进入 hypervisor;规范需要正确处理锁和状态的所有权。
|
||||
- 规范需要松弛,例如,pKVM 对 host 内存映射采用按需映射机制,无法精确指定每一步行为;必须在规范中抽象掉不必要的实现细节。
|
||||
- 运行环境受限,由于 pKVM 运行在 EL2,无常规测试工具(如 coverage、调试器)可用,使用随机的错误输入可能导致整个系统崩溃。
|
||||
- 规范需要直接在 C 语言中实现(pKVM 所用语言),而不是在 Coq、Lean 之类的形式化语言中;C 缺乏代数数据类型、纯函数子集、pattern matching、泛型等高层特性。
|
||||
|
||||
### 三、设计
|
||||
|
||||
论文的核心设计是定义“可执行的 test-oracle 规范”,从而能够根据规范:1. 生成符合运行环境规范的随机测试输入,避免纯随机的测试输入导致系统完全 crash;2. 通过 oracle 得到运行测试后 pKVM 的状态,与实际 pKVM 运行后的状态进行比较进行测试。
|
||||
|
||||
1. 定义一个抽象状态(Ghost state),作为 hypervisor 真实状态的高层数学模型;ghost state 包含如下的信息。在抽象时过滤掉内存分配、页表内部结构等与外部可观察行为无关的实现细节。
|
||||
- pkvm 自身 stage-1 映射
|
||||
- host stage-2 映射
|
||||
- 每个 VM 的 stage-2 映射与元信息
|
||||
- 全局常量(物理 CPU 数、地址偏移等)
|
||||
- 每个 CPU 的本地状态
|
||||
![[251014-144626.png]]
|
||||
2. 通过一组抽象函数在运行时从 pKVM 的真实状态生成 ghost state,同时保证抽象与锁的持有状态严格对应(只有在持锁时才能抽象相关状态)。
|
||||
3. 使用 C 语言实现 hypercall 的“纯函数”规范,在 hypercall 执行前后分别记录 pre/post ghost state,然后比较与规范计算结果是否一致。例如 `host_share_hyp`:
|
||||
- 输入:抽象 pre-state + 调用参数 + 环境信息
|
||||
- 输出:抽象 post-state
|
||||
- 只依赖 ghost state,不依赖真实实现
|
||||
4. 松弛规范与非确定性:
|
||||
- 对于内存不足错误(`ENOMEM`)等不影响核心语义的行为,规范不强制实现;
|
||||
- 规范参数化于实现返回值,以支持松弛匹配;
|
||||
- 对 host 与 pKVM 共享内存的交互,通过记录实际读取的值来消除非确定性。
|
||||
5. 检查抽象状态在 hypercall 之间不发生“锁外干扰”;检查页表 footprint 不被越界修改,实现 separation logic 风格的隔离保证。
|
||||
|
||||
### 四、测试与评估
|
||||
|
||||
本工作通过修改 Linux 内核引入 “hyp-proxy” 接口,使 pKVM 的 hypercall 能在用户态直接调用,并配合自研的覆盖率工具,对实现与规范的行、分支和函数进行监测。测试主要包括三类:
|
||||
1. 手写测试,共 41 个用例,覆盖所有 hypercall 的正常与错误分支
|
||||
2. 随机测试,基于 ghost state 生成高质量随机输入以避免系统崩溃
|
||||
3. 合成 bug 测试,通过注入错误验证测试框架的有效性
|
||||
实际测试中共发现 5 个 pKVM 的真实 bug,包括 allocator 内存对齐、memcache 越界访问、竞争条件和 IO 映射重叠等问题,其中第 4 个是通过规范对比发现的。此外,规范规模约 8.4K 行,接近实现的 11K 行,在 QEMU 环境下内存占用约 18MB,启动时间增加 3.2 倍,测试时间增加 11.5 倍,测试速率达每小时约 20 万次 hypercall。作为总结,本工作提出并验证了一种用“可执行规范”动态对比实现的轻量级验证方法,在不依赖繁重形式化工具的前提下,对生产级 hypervisor(pKVM)实现了实用、低成本的高可信度保障。
|
||||
BIN
phd/papers/fMoE.figs/250629-153138.png
Normal file
|
After Width: | Height: | Size: 110 KiB |
BIN
phd/papers/fMoE.figs/250629-161518.png
Normal file
|
After Width: | Height: | Size: 136 KiB |
BIN
phd/papers/fMoE.figs/250629-163825.png
Normal file
|
After Width: | Height: | Size: 98 KiB |
BIN
phd/papers/fMoE.figs/250629-172314.png
Normal file
|
After Width: | Height: | Size: 168 KiB |
BIN
phd/papers/fMoE.figs/250629-173919.png
Normal file
|
After Width: | Height: | Size: 185 KiB |
BIN
phd/papers/fMoE.figs/250629-174619.png
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
phd/papers/fMoE.figs/250629-175351.png
Normal file
|
After Width: | Height: | Size: 116 KiB |
BIN
phd/papers/fMoE.figs/250629-175536.png
Normal file
|
After Width: | Height: | Size: 113 KiB |
BIN
phd/papers/fMoE.figs/250629-175858.png
Normal file
|
After Width: | Height: | Size: 126 KiB |
BIN
phd/papers/fMoE.figs/250629-221659.png
Normal file
|
After Width: | Height: | Size: 76 KiB |
BIN
phd/papers/fMoE.figs/250629-222203.png
Normal file
|
After Width: | Height: | Size: 38 KiB |
119
phd/papers/fMoE.md
Normal file
@@ -0,0 +1,119 @@
|
||||
[fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
|
||||
|
||||
# TL;DR
|
||||
|
||||
由于 MoE 的稀疏稀疏激活特性,且端侧设备通常 GPU 内存十分紧张,因此会选择把当前激活的 expert load 到 GPU 上,这样会带来很大的推理延迟,有一系列工作通过 expert prefetch 来缓解这个问题。
|
||||
本工作提出了 fine-grained 的 expert prefetch 和 cache 策略,通过提出新的数据结构 expert map 对历史请求的 expert 激活情况进行存储。对于新请求,使用同时基于 **semantic** 和 **trajectory** 的 expert similarity match 方案,在 expert map store 中与历史记录进行 match,从而进行 expert 的 prefetch。
|
||||
![[250629-172314.png]]
|
||||
|
||||
# Motivation
|
||||
|
||||
现有的 expert offloading 方案不能实现最优的 latency-memory trade-off。
|
||||
![[250629-222203.png]]
|
||||
|
||||
- 大多数基于 MoE 的 LLM 采用 decoder-only 架构,相较于 encoder-decoder 架构的 LLM,其专家激活模式更为均匀,专家访问的偏斜程度较低。
|
||||
- MoE LLM 在训练时引入了一种独特的负载均衡损失函数,该损失函数强制门控网络在同一 MoE 层内平衡分配到各个 expert 的 token 数量,从而确保在整个训练过程中没有 expert 处于闲置状态。这种平衡的路由机制削弱了专家激活模式的可预测性,因此使现有的解决方案不太高效。
|
||||
|
||||
coarse-grained 的 expert offloading 会导致:
|
||||
- expert hit rate 较低
|
||||
- 忽视 MoE 模型和 prompts 的多样性
|
||||
![[250629-221659.png]]
|
||||
|
||||
因此本工作要解决的三个核心问题:
|
||||
- How to maximize expert hit rate when prefetching and offloading experts?
|
||||
- How to adapt to different MoE models and prompts?
|
||||
- How to avoid additional system overheads when managing experts?
|
||||
|
||||
# Design
|
||||
|
||||
1. expert map 的信息采集
|
||||
![[250629-153138.png]]
|
||||
|
||||
2. expert map search
|
||||
![[250629-161518.png]]
|
||||
|
||||
- Semantic-based expert map search
|
||||
$$\text{score}_{x, y}^{\text{sem}} = \frac{\text{sem}_x^{\text{new}} \cdot \text{sem}_y^{\text{new}}}{||\text{sem}_x^{\text{new}}|| \cdot ||\text{sem}_y^{\text{new}}||}$$
|
||||
$\text{sem} \in \mathbb{R}^{1 \times H}$
|
||||
|
||||
|
||||
- Trajectory-based expert map search
|
||||
$$\text{score}_{x, y}^{\text{map}} = \frac{\text{map}_x^{\text{new}} \cdot \text{map}_y^{\text{new}}}{||\text{map}_x^{\text{new}}|| \cdot ||\text{map}_y^{\text{new}}||}$$
|
||||
在计算第 $l$ 层时,$\text{map} \in \mathbb{R}^{1 \times (l - 1)J}$,$J$ 为每层的 expert 数量
|
||||
|
||||
本工作分析了 similarity 和 expert cache hit rate 的相关性:
|
||||
![[250629-163825.png]]
|
||||
|
||||
3. expert prefetch
|
||||
|
||||
$$\delta_l = \text{Clip}(1 - \text{score}, 0, 1)$$
|
||||
|
||||
选择 Top-N 个 experts,使得他们的 probability $p_{l, j}$ 之和大于 $\delta_l$:
|
||||
$$\sum_{E_{l, j} \in E_{\text{prefetch}}} p_{l, j} \geq \delta_l$$
|
||||
且满足,$K \leq N \leq J$
|
||||
|
||||
|
||||
4. expert map storage deduplication
|
||||
|
||||
对于新 batch 中的每个 $x$,当前 expert map storage 中的每个 $y$,可以计算 redundancy:
|
||||
$$\text{RDY}_{x, y} = \frac{d}{L} \text{score}^{sem}_{x, y} + \frac{L - d}{L} \text{score}^{map}_{x, y}$$
|
||||
|
||||
expert map 去重可以形式化为 Minimum Sphere Covering 问题
|
||||
|
||||
现有研究证明:维护至少 $2LJ$ 个 expert maps 保证新输入至少能找到 75% 以上相似的 expert map,$\frac{1}{2} LJ\ln(LJ)$ 个时能达到 98%。当 $L \leq 128, J \leq 96$ 时,只需维护不超过 50K 个 expert maps,大约 200MB CPU memory。
|
||||
|
||||
5. expert cache and eviction
|
||||
|
||||
- expert prefetching priority
|
||||
对于每个 $E_{l, j} \in E_{\text{prefetch}}$:
|
||||
$$\text{Pri}_{l, j}^{\text{prefetch}} = \frac{p_{l, j}}{l - l_{\text{now}}}$$
|
||||
- expert eviction priority
|
||||
$$\text{Pri}_{l, j}^{\text{evict}} = \frac{1}{p_{l, j} \cdot freq_{l, j}}$$
|
||||
evict 时不考虑 LRU,因为违背了 layers 向前计算的顺序性本质
|
||||
|
||||
# Evaluation
|
||||
|
||||
Testbed
|
||||
- 6 * RTX 3090 with 24GB GPU memory
|
||||
- PCIe 4.0 with 32GB/s
|
||||
|
||||
Models
|
||||
- Mixtral 8x7B
|
||||
- Qwen1.5-MoE
|
||||
- Phi3.5-MoE
|
||||
|
||||
Traces
|
||||
- LMSYS-Chat-1M
|
||||
- ShareGPT
|
||||
- Online serving: Microsoft Azure
|
||||
|
||||
Baselines
|
||||
1. MoE-Infiity
|
||||
2. ProMoE
|
||||
3. Mixtral-Offloading
|
||||
4. DeepSpeed-Inference
|
||||
|
||||
- 整体性能测试结果
|
||||
|
||||
![[250629-173919.png]]
|
||||
|
||||
相比于 DeepSpeed-Inference, Mixtral-Offloading, Pro-MoE, MoE-Infinity,fMoE 将 average TTFT 降低了 44%, 35%, 33%, 30%,将 average TPOT 降低了 70%, 61%, 55%, 48%,将 average expert hit rate 提高了 147%, 11%, 34%, 63%
|
||||
|
||||
- online serving 测试结果
|
||||
![[250629-174619.png]]
|
||||
|
||||
- ablation study:
|
||||
- speculate: Mixtral-Offloading & ProMoE
|
||||
- hit count: MoE Infinity
|
||||
- Map (T): only trajectory
|
||||
- Map (T+S): both trajectory and semantic
|
||||
- Map (T+S+$\delta$): all features
|
||||
![[250629-175351.png]]
|
||||
|
||||
|
||||
- Sensitivity: prefetch distance (d)
|
||||
When the prefetch distance is small (< 3), fMoE cannot perfectly hide its system delay from the inference process, such as the map matching and expert prefetching, leading to the increase of inference latency. With larger prefetch distances (> 3), fMoE has worse expert hit rates that also degrade the performance. Therefore, we set the prefetch distance d to 3 for evaluating fMoE.
|
||||
![[250629-175536.png]]
|
||||
|
||||
- overhead 分析:expert prefetching, map matching 和 map update 是异步的 background,实际对 E2E 有影响的 < 30ms(5% 每个 iteration)
|
||||
![[250629-175858.png]]
|
||||
2
phd/research/MoE/Base.md
Normal file
@@ -0,0 +1,2 @@
|
||||
MoE 提供的 sys 可做的点(现有的工作分类):
|
||||
- expert offloading,通过在显存上只保存需要被激活的 expert,来大大降低显存需求,实现消费级显卡运行大模型。
|
||||
BIN
phd/research/MoE/Papers.figs/250216-215720.png
Normal file
|
After Width: | Height: | Size: 565 KiB |
30
phd/research/MoE/Papers.md
Normal file
@@ -0,0 +1,30 @@
|
||||
#### [MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing](https://arxiv.org/pdf/2502.06643v1)
|
||||
两个问题:1. expert 间负载不均衡;2. distributed 时 token 在 expert 间的路由不均衡,通信时延大
|
||||
1. 整数线性规划(ILP)优化
|
||||
MOETUNER 通过 ILP 来优化专家放置,同时考虑 token 负载、通信和计算成本。ILP 能够在给定的约束条件下找到最优解,确保专家放置策略能够最大化 GPU 利用率,最小化空闲时间,并减少跨 GPU 的通信开销。
|
||||
2. 利用跨层依赖性
|
||||
MOETUNER 利用了跨层 token 路由依赖性的特性,即一个 token 在某一层被路由到特定的专家后,它在下一层更有可能被路由到某些特定的专家。通过这种依赖性,MOETUNER 可以更有效地减少跨 GPU 的通信,并确保 token 路由的负载在不同 GPU 之间更加平衡。
|
||||
3. 两阶段 ILP 优化
|
||||
MOETUNER 的 ILP 优化分为两个阶段:
|
||||
第一阶段:负载均衡的专家聚类(Load-Balanced Expert Clustering)
|
||||
目标:在每一层中,将专家聚类,使得每个聚类的 token 处理负载尽可能均衡。
|
||||
方法:通过 ILP 优化,将专家分配到不同的聚类中,确保每个聚类的 token 负载接近平均负载。
|
||||
第二阶段:聚类到 GPU 的分配(Cluster-to-GPU Assignment)
|
||||
目标:将聚类分配到不同的 GPU 上,以最小化跨 GPU 的 token 路由成本。
|
||||
方法:通过 ILP 优化,将聚类分配到 GPU 上,同时考虑通信成本和 GPU 的容量限制。
|
||||
|
||||
|
||||
#### [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/pdf/2201.05596)
|
||||
提出了一种新的 MoE 架构设计,称为 Pyramid-Residual MoE (PR-MoE),通过在模型的后几层使用更多的专家,并引入残差连接,减少了模型参数大小,同时保持模型质量
|
||||
![[250216-215720.png]]
|
||||
- [ ] How does expert parallelism work?
|
||||
|
||||
|
||||
|
||||
#### [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/pdf/2410.22134)
|
||||
|
||||
|
||||
|
||||
#### [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://www.arxiv.org/pdf/2502.05370)
|
||||
|
||||
|
||||
17
phd/weekly-report/24/241027.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
- Port XPURemoting to PhOS
|
||||
|
||||
Key Results
|
||||
- Enhance QWen trace's workloads separation
|
||||
- Get vLLM KVCache hit rate for different open source workloads
|
||||
- Build unified docker image for XPURemoting and PhOS
|
||||
|
||||
Last Week
|
||||
- Get a unified workload taxonomy for QWen trace in both Web and App ends.
|
||||
- Run vLLM(Ali ver) and start to customize to get some features(e.x. KVCache hit rate for different workloads).
|
||||
- Build a new docker image to satisfy PhOS's base image requirement with XPURemoting env(static linked PyTorch 1.13.1).
|
||||
|
||||
Next Week
|
||||
- Customize vLLM to support new features like KVCache schedule policy comparation.
|
||||
16
phd/weekly-report/24/241103.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Tokenize Qwen trace with Qwen-agent and some other tools [60%]
|
||||
- Modify vLLM to support different KV cache block number
|
||||
- Profile open source dataset with different cache blocks
|
||||
|
||||
Last Week
|
||||
- Use Qwen-agent to handle workloads with file, get a more precise token length for these workloads.
|
||||
- Modify vLLM's cache manager to support specific KVCache cache blocks, then measure the KV cache hit rate trend by block number in different workloads.
|
||||
|
||||
Next Week
|
||||
- Tokenize all Qwen trace especially multimodal (image) workloads and measure with these trace.
|
||||
- Profile KVCache cache hit rate in actual trace and compare with other open source trace to find different.
|
||||
15
phd/weekly-report/24/241110.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Tokenize Qwen trace with Qwen-agent and some other tools
|
||||
- Profile Qwen trace with different cache blocks
|
||||
|
||||
Last Week
|
||||
- Use Qwen-agent to handle all workloads in Qwen trace and get a precise token stream to simulate actual online environment.
|
||||
- Measure the performance and KVCache cache hit rate for different cache blocks using real Qwen trace running for one hour.
|
||||
|
||||
Next Week
|
||||
- Check the tokenize results from Qwen trace, maybe need to modify.
|
||||
- Measure KV cache performance with CPU memory.
|
||||
14
phd/weekly-report/24/241117.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Test modified vLLM which supports CPU KV cache
|
||||
- Profile and breakdown modified vLLM in synthetic data and real Qwen trace
|
||||
|
||||
Last Week
|
||||
- Merge vLLM which supports CPU KV cache and use synthetic data and real Qwen trace to measure the performance and find bugs.
|
||||
- Add a breakdown measurement support in vLLM server side to measure the time for copying of KV blocks.
|
||||
|
||||
Next Week
|
||||
- Run more test for vLLM which supports CPU KV cache.
|
||||
- Try to optimize current implementation.
|
||||
17
phd/weekly-report/24/241124.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objective
|
||||
- Workload-centric KV cache scheduling
|
||||
- XPURemoting adaption for PhOS
|
||||
|
||||
Key Results
|
||||
- Refactor vLLM benchmark tools to get more precise metrics
|
||||
- Simulate different token lengths and hit rate to define hit rate's effect
|
||||
- Modify XPURemoting to support new architecture
|
||||
|
||||
Last Week
|
||||
- Implement a unified vLLM benchmark tool to get more precise metric results and provide a unified requests builder.
|
||||
- Measure the effect of cache hit rate and try to define a good hit rate for real performance improvement.
|
||||
- Merge XPURemoting with new features and support for PhOS.
|
||||
|
||||
Next Week
|
||||
- Define a `good hit rate` for KV cache scheduling.
|
||||
- Finish XPURemoting adaption.
|
||||
16
phd/weekly-report/24/241201.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objective
|
||||
- Workload-centric KV cache scheduling
|
||||
- XPURemoting adaption for PhOS
|
||||
|
||||
Key Results
|
||||
- Define the Good KVCache hit rate in different conditions [6/10]
|
||||
- Prove the interference between different workloads in current vLLM
|
||||
- Modify XPURemoting to support PhOS (v1)
|
||||
|
||||
Last Week
|
||||
- Search different KVCache schedule algorithms and sumarize something common for definition of Good KVCache hit rate.
|
||||
- Profile ali trace in vLLM and group them to prove interference.
|
||||
- Adaption of XPURemoting to support current PhOS's API. And fully test implementation in PhOS's open source examples. [MR](https://ipads.se.sjtu.edu.cn:1312/scaleaisys/xpuremoting/-/merge_requests/25) for XPURemoting and [e80bf94](https://github.com/Gahow/PhoenixOS/commit/e80bf94075fcd6f53c97406dadfbe7f13fc16092) for PhOS.
|
||||
|
||||
Next Week
|
||||
- Finish definetion of Good KVCache hit rate.
|
||||
15
phd/weekly-report/24/241208.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- PhOS profile
|
||||
|
||||
Key Results
|
||||
- Implement a workload aware KVCache scheduler. [3/10]
|
||||
- Provide test apps for PhOS
|
||||
|
||||
Last Week
|
||||
- Implement a simulator for KVCache scheduler to quick test different policies.
|
||||
- Prepare and do a paper sharing in Ali.
|
||||
- Provide StableDiffusion single GPU train, Llama2-13b multi GPU train, Llama2-70b multi GPU inference script for PhOS profiling.
|
||||
|
||||
Next Week
|
||||
- Implement a solution to reduce KVCache memory need.
|
||||
13
phd/weekly-report/24/241215.md
Normal file
@@ -0,0 +1,13 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Test a workload aware KVCache scheduler
|
||||
- Implement the workload aware policy in vLLM
|
||||
|
||||
Last Week
|
||||
- Design a workload aware schedule policy in simulator and profile the KVCache reuse rate.
|
||||
- Implement the designed policy under vLLM.
|
||||
|
||||
Next Week
|
||||
- Profile the real performance of new policy under vLLM and do some enhancement.
|
||||
16
phd/weekly-report/24/241222.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Implement the workload aware policy in vLLM [8/10]
|
||||
- Profile the workload aware policy [3/10]
|
||||
- Supply workloads difference in Qwen trace
|
||||
|
||||
Last Week
|
||||
- Add new design point to cache policy, making the policy to consider cache memory size and predicted reuse distance together. To do this, add a new monitor for workloads' reuse time interval and average number of tokens.
|
||||
- Set a offline (i.e. best) scheduling policy, profile the default policy, our workload aware policy and offline policy to show the performance difference in CDF of TTFT.
|
||||
- Implement a cache block source tracker in vLLM to show where the KVCache reuse comes from. Prove that 90% of KVCache reuse comes from multi turns chat.
|
||||
|
||||
Next Week
|
||||
- Improve the performance of our policy.
|
||||
- Plot some formal figures.
|
||||
14
phd/weekly-report/24/241229.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Implement the workload aware policy in vLLM
|
||||
- Profile the workload aware policy [3/10]
|
||||
|
||||
Last Week
|
||||
- Implement priority-based (calculated by our policy) evictor for both GPU and CPU sides.
|
||||
- Test our policy under ralative small cache memory, and get a 30% cache hit ratio and 10% performance improvement. Prove our policy is used for limited cache memory. But for the larger cache memory, our policy still need some fine-tune.
|
||||
|
||||
Next Week
|
||||
- Improve our policy for larger cache memory.
|
||||
- Analysis new trace.
|
||||
12
phd/weekly-report/25/250105.md
Normal file
@@ -0,0 +1,12 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Analysis new trace
|
||||
|
||||
Last Week
|
||||
- Get the qwen-plus trace, analysis its feature. 96% requests come from script. Many long system prompt which will be used many times (greater than 100k in 4h).
|
||||
- Confirm trace A and trace B for paper. Draw figures for them.
|
||||
|
||||
Next Week
|
||||
- Profile our policy.
|
||||
11
phd/weekly-report/25/250112.md
Normal file
@@ -0,0 +1,11 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Run test and get number for paper
|
||||
|
||||
Last Week
|
||||
- Do all things about paper.
|
||||
|
||||
Next Week
|
||||
- Finish paper.
|
||||
11
phd/weekly-report/25/250119.md
Normal file
@@ -0,0 +1,11 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Finish paper for ATC
|
||||
|
||||
Last Week
|
||||
- Do all things about paper.
|
||||
|
||||
Next Week
|
||||
- Go for vacation
|
||||
17
phd/weekly-report/25/250223.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
- MoE study
|
||||
|
||||
Key Results
|
||||
- Check the trace from Ali and fix problems
|
||||
- Define a formatted trace structure for incoming refine
|
||||
- Study papers about MoE, run int4 DeepSeek v3 671B in 8 * A800
|
||||
|
||||
Last Week
|
||||
- Communicate with a colleague in Ali to get a desired trace and check the problems in trace to give feedback.
|
||||
- Design a standard trace structure for better refining, then start format the trace in 12h for test.
|
||||
- Study on MoE and find a int4 quantization version DeepSeek v3 671B to run in 8 * A800.
|
||||
|
||||
Next Week
|
||||
- Format all trace to desired structure.
|
||||
- Study on DeepSeek v3 to see how the experts do parallelism.
|
||||
14
phd/weekly-report/25/250302.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Format traceA and traceB to standard format and get the chat session
|
||||
|
||||
Last Week
|
||||
- Update the process script to support streaming and format 24h data for traceA and traceB.
|
||||
- Preparing paper-sharing.
|
||||
- Go back to school for intern defense.
|
||||
|
||||
Next Week
|
||||
- Analysis traceA and traceB in 24h data.
|
||||
- Survey the different DeepSeek deploy method.
|
||||
14
phd/weekly-report/25/250309.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Test traceA and traceB and fix bugs
|
||||
- Survey the hardware for MoE deploying in medium-scale cluster
|
||||
|
||||
Last Week
|
||||
- Do test on traceA and traceB, then fix bugs for the format pass to handle corner cases.
|
||||
- Learn the calculation details of MLA and MoE to estimate the memory and calculation requirements, and compare with the different hardware.
|
||||
|
||||
Next Week
|
||||
- Re-plot all the figures about trace.
|
||||
- Survey the MoE deployment.
|
||||
17
phd/weekly-report/25/250316.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
- DeepSeek deployment study
|
||||
|
||||
Key Results
|
||||
- Refine some trace figures in 24h trace
|
||||
- Give a cache policy evaluation method (w/ Jinbo)
|
||||
- Survey the hardware for MoE deploying in medium-scale cluster
|
||||
|
||||
Last Week
|
||||
- Finish all the trace clean and preprocess and re-plot some figures for traceA and traceB in new trace.
|
||||
- Communicate with Jinbo to have a better understand in the gap between vLLM cache management and traditional cache policy. Figure out a evaluation method to judge the cache policy.
|
||||
- Calculate the FLOPs requirement for DeepSeek.
|
||||
|
||||
Next Week
|
||||
- Test and refine the cache policy.
|
||||
- Try to summary the challenges for medium-scale deployment.
|
||||
15
phd/weekly-report/25/250323.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- DeepSeek deployment study
|
||||
|
||||
Key Results
|
||||
- Write a KVCache simulator to speed up policy test
|
||||
- Refine S3-FIFO to get some improvement
|
||||
|
||||
Last Week
|
||||
- Write a *naive* KVCache simulator to align with vLLM's KVCache management. And have very small bias comparing to real vLLM.
|
||||
- Refine the S3-FIFO in vLLM and evaluate it. It can have a little improvement in relatively small cache space.
|
||||
- Write the middle-stage report for graduation thesis.
|
||||
|
||||
Next Week
|
||||
- Refine the cache policy.
|
||||
13
phd/weekly-report/25/250330.md
Normal file
@@ -0,0 +1,13 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Implement PDF-based workload-aware cache policy in simulator
|
||||
- Test the policy with different refine methods
|
||||
|
||||
Last Week
|
||||
- Implement workload-aware cache policy by exponential distribution fitting and get stable hit ratio improvement for the first time.
|
||||
- Try to monitor with time sliding window, warm up for fitting coefficients, use oracle fitting coefficients etc. But all of them cannot get a notable improvement.
|
||||
|
||||
Next Week
|
||||
- Refine the cache policy.
|
||||
13
phd/weekly-report/25/250406.md
Normal file
@@ -0,0 +1,13 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Analysis the difference between LRU/WA/oracle
|
||||
|
||||
Last Week
|
||||
- Define the difference of cache policies with a reuse rank (for each cache hit, we can get current key's rank in a cache policy). Evaluate different cache policies by reuse rank and draw CDF.
|
||||
- Prepare and do middle term graduating thesis offense.
|
||||
|
||||
Next Week
|
||||
- Do rebuttal for ATC.
|
||||
- Implement WA policy in vllm and test.
|
||||
13
phd/weekly-report/25/250413.md
Normal file
@@ -0,0 +1,13 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Rebuttal for ATC'25
|
||||
- Refine cache policy implementation
|
||||
|
||||
Last Week
|
||||
- Finish rebuttal for ATC'25 w/ Jinbo.
|
||||
- Fix some bugs in our cache policy and test in simulator to get a bit hit ratio improvement.
|
||||
|
||||
Next Week
|
||||
- Implement WA policy in vllm and test.
|
||||
14
phd/weekly-report/25/250420.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Refine cache policy implementation
|
||||
- Write graduation thesis
|
||||
|
||||
Last Week
|
||||
- Fix some bugs in our cache policy and test in simulator to get a bit hit ratio improvement.
|
||||
- Fix bugs for cache policy and simulator and refine policy to always (1x, 2x, 4x) get better cache hit ratio compared to LRU.
|
||||
- Write graduation thesis for 20 pages.
|
||||
|
||||
Next Week
|
||||
- Refine cache policy to get better performance.
|
||||
15
phd/weekly-report/25/250427.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Refine cache policy implementation
|
||||
- Implement and test our workload-aware cache policy in vLLM
|
||||
- Write graduation thesis
|
||||
|
||||
Last Week
|
||||
- Refine cache policy to consider the _cost_ of keeping cache in memory, and get about 1% to 2% hit rate improvement under 1k+1k cache blocks.
|
||||
- Implement PDF-based workload-aware cache policy in vLLM and profile LRU v.s. WA under Qwen2-7B, get 25% QTTFT reduction.
|
||||
- Finish the first draft of graduation thesis.
|
||||
|
||||
Next Week
|
||||
- Do full test for different cache policies and under different models.
|
||||
11
phd/weekly-report/25/250504.md
Normal file
@@ -0,0 +1,11 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Null
|
||||
|
||||
Last Week
|
||||
- Labor Day Vocation
|
||||
|
||||
Next Week
|
||||
- TBD
|
||||
16
phd/weekly-report/25/250511.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Preprocess 5 days trace for comparison
|
||||
- Draw policy results in Trace A under 7B, 13B, 70B models
|
||||
- Write policy algorithm for paper
|
||||
- Prepare an version of trace for open source
|
||||
|
||||
Last Week
|
||||
- Get 5 workdays trace and preprocess them for future simulator test.
|
||||
- Write paper policy design part, finish a pseudocode for our cache policy.
|
||||
- Process and get an anonymous trace for open source.
|
||||
|
||||
Next Week
|
||||
- Finish final version paper writing
|
||||
15
phd/weekly-report/25/250518.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Prepare a repo for open source Qwen trace
|
||||
- Write paper policy part
|
||||
- Draw policy test figs
|
||||
|
||||
Last Week
|
||||
- Prepare a trace repo for the flow to open source in Ali.
|
||||
- Write paper policy design and eval parts.
|
||||
- Rerun policy test multi times to draw figs with shadow(error bar).
|
||||
|
||||
Next Week
|
||||
- Finish final version paper
|
||||
15
phd/weekly-report/25/250525.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Result
|
||||
- Refine a final version of KV$ cache for ATC'25
|
||||
- Prepare graduation defense slides
|
||||
|
||||
Last Week
|
||||
- Finish the final version of KV$ cache and send to the shepherd.
|
||||
- Finish a slides and submit materials for graduation defense.
|
||||
- Learn from ChinaSys'25.
|
||||
|
||||
Next Week
|
||||
- Go for graduation defense.
|
||||
- Polish for the camera ready version of KV$ cache.
|
||||
17
phd/weekly-report/25/250601.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE autoscaling
|
||||
|
||||
Key Results
|
||||
- [10/10] Refine a final version of KV$ cache for ATC'25
|
||||
- [10/10] Graduation thesis defense
|
||||
- [2/10] Run MoE model in Ali
|
||||
- [0/10] Analysis the pattern of experts loading in Ali trace
|
||||
|
||||
Last Week
|
||||
- Prepare and finish graduation defense.
|
||||
- Polish the final version of KV$ cache and send to the shepherd.
|
||||
- Run Qwen3-32B on latest vLLM.
|
||||
|
||||
Next Week
|
||||
- Modify vLLM to support tracing the expert load pattern.
|
||||
17
phd/weekly-report/25/250608.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE autoscaling
|
||||
|
||||
Key Results
|
||||
- [10/10] Refine a final version of KV$ cache for ATC'25
|
||||
- [8/10] Run MoE model in Ali
|
||||
- [0/10] Analysis the pattern of experts loading in Ali trace
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
|
||||
Last Week
|
||||
- Modify vLLM to support tracing the activated experts and test on Ali trace with Qwen3-32B.
|
||||
- Prepare and submit KV$ cache to arXiv.
|
||||
|
||||
Next Week
|
||||
- Analysis the experts pattern.
|
||||
- Test on more MoE models.
|
||||
20
phd/weekly-report/25/250615.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [0/10] Prepare slides for ATC'25 presentation w/ Jinbo
|
||||
- [8/10] Run MoE models in Ali
|
||||
- [5/10] Analysis the pattern of experts loading in Ali trace
|
||||
- [3/10] Analysis the expert pattern in different models
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Develop in vLLM to support tracing expert pattern with PP and distributed with Ray for DeepSeek-671B.
|
||||
- Analysis expert pattern's temporal locality.
|
||||
|
||||
Next Week
|
||||
- Develop in vLLM fully for all models.
|
||||
- Analysis the expert pattern's correlations between layers.
|
||||
22
phd/weekly-report/25/250622.md
Normal file
@@ -0,0 +1,22 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [5/10] Prepare slides for ATC'25 presentation w/ Jinbo
|
||||
- [1/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [0/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Tracing expert pattern with Qwen trace under Qwen3-235B and DeepSeek-671B.
|
||||
- Analysis expert pattern's temporal locality in large models (Qwen3-235B and DeepSeek-671B).
|
||||
- Prepare KVCache slides.
|
||||
- All misc for graduation.
|
||||
|
||||
Next Week
|
||||
- Analysis the expert pattern's correlations between layers.
|
||||
- Survey current MoE works for more observations to check.
|
||||
20
phd/weekly-report/25/250629.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [9/10] Prepare slides for ATC'25 presentation w/ Jinbo
|
||||
- [6/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [0/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Survey MoE works and summarize their key points.
|
||||
- Refine KVCache slides w/ Jinbo.
|
||||
- Nit: support Ali machine usage and give a landing doc.
|
||||
|
||||
Next Week
|
||||
- Check the feasibility of EP combinatory method.
|
||||
19
phd/weekly-report/25/250706.md
Normal file
@@ -0,0 +1,19 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [10/10] Prepare slides for ATC'25 presentation w/ Jinbo
|
||||
- [6/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [4/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Survey the architecture of Bailian, read their docs, get some knowledge of their gateway, cluster setup and some serverless service.
|
||||
- Refine KVCache slides w/ Jinbo and Dingyan.
|
||||
|
||||
Next Week
|
||||
- Skip for one week.
|
||||
20
phd/weekly-report/25/250713.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [10/10] Prepare slides for ATC'25 presentation w/ Jinbo
|
||||
- [6/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [4/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Go for vacation.
|
||||
|
||||
Next Week
|
||||
- Understand the infrastructure of Bailian further.
|
||||
- Review and write comments for at least 3 paper as a shadow PC.
|
||||
- Learn about current MoE network feature under different parallelism mode.
|
||||
20
phd/weekly-report/25/250720.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [6/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [4/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Survey the infrastructure of Bailian, specially in model serving and batching.
|
||||
- Give a KVCache cache talk in Ali w/ Jinbo.
|
||||
- Review 2 papers as shadow PC.
|
||||
- Survey the agent workflow for potential system problem.
|
||||
|
||||
Next Week
|
||||
- Survey the different parallelism setup scheduling.
|
||||
- Review and write comments for all assigned papers.
|
||||
18
phd/weekly-report/25/250727.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- MoE pattern feature
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [6/10] Survey MoE works and their observations
|
||||
- [9/10] Analysis experts load balance's temporal locality
|
||||
- [4/10] Analysis correlations between MoE layers
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
|
||||
Last Week
|
||||
- Survey about heterogeneous parallelism config setup for different workloads and SLO.
|
||||
- Finish the review for all papers as a shadow PC.
|
||||
|
||||
Next Week
|
||||
- Survey the chance and challenges for EP reconfiguration.
|
||||
- Survey the agentic AI infra.
|
||||
18
phd/weekly-report/25/250803.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Heterogenous parallelism in cluster
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
- [4/10] Analysis correlations between MoE layers (suspended)
|
||||
|
||||
Last Week
|
||||
- [For KR1] Run latest vLLM with different parallelism configurations (TP, PP, DP, EP) in Qwen-30B with fixed input/output length to get their difference.
|
||||
- [Misc] Write AIR project conclusion docs for the collaboration in Ali w/ Jinbo.
|
||||
|
||||
Next Week
|
||||
- Test different parallelism configurations with latest Ali trace.
|
||||
- Analysis the performance pattern in different workloads.
|
||||
22
phd/weekly-report/25/250810.md
Normal file
@@ -0,0 +1,22 @@
|
||||
Objectives
|
||||
- Heterogenous parallelism in cluster
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
- [4/10] Analysis correlations between MoE layers (suspended)
|
||||
|
||||
Last Week
|
||||
- [For KR1] Read vLLM code and understand how vLLM TP/PP/DP works.
|
||||
- [For KR1] Run profile test with different config in a more complete search space.
|
||||
- [Surveying] Understand the bottleneck of autoscaling in Ali.
|
||||
- [Surveying] The opportunity for profile kernel and get a best compute graph to guide the parallelism config.
|
||||
- [Misc] Prepare slides for AIR project conclusion defense.
|
||||
|
||||
Next Week
|
||||
- Survey the possibility of a universal parallelism config search based on kernel. (Start from the related works about NanoFlow)
|
||||
- Check the possibility to use GPU bubbles which running small models.
|
||||
- Check the challenges to switch parallelism config with context.
|
||||
21
phd/weekly-report/25/250817.md
Normal file
@@ -0,0 +1,21 @@
|
||||
Objectives
|
||||
- Heterogenous parallelism in cluster
|
||||
- EP design for inference performance
|
||||
|
||||
Key Results
|
||||
- [6/10] Profile vLLM to get compute graph
|
||||
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
|
||||
- [0/10] Understand how EP influence performance fully
|
||||
- [0/10] Verify how dynamic EP influence performance
|
||||
- [4/10] Analysis correlations between MoE layers (suspended)
|
||||
|
||||
Last Week
|
||||
- [Surveying] Learn about the compute graph arrangement in traditional streaming/batch system and compared to LLM inference system.
|
||||
- [KR1] Profile the vLLM to get kernels time consuming, overlapping status.
|
||||
- [Misc] Review 3 papers as shadow PC for Round 2.
|
||||
- [Misc] Prepare and finish the AIR project conclusion defense with slides.
|
||||
|
||||
Next Week
|
||||
- Summarize a table for the similarities and challenges in compute graph arrangement optimization between traditional streaming system and LLM inference system.
|
||||
18
phd/weekly-report/25/250824.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Heterogenous parallelism in cluster
|
||||
- EP design for inference performance [untracked]
|
||||
|
||||
Key Results
|
||||
- [6/10] Profile vLLM to get compute graph
|
||||
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
|
||||
- [0/10] Understand how EP (static/dynamic) influence performance fully
|
||||
- [4/10] Analysis correlations between MoE layers [suspended]
|
||||
|
||||
Last Week
|
||||
- [KR2] Learn about triton (vLLM has many kernel implemented in triton), run a demo to compile the python triton kernel to get ptx then loaded and called in Rust.
|
||||
- [KR2] Try a demo to run vLLM's flash-attention in Rust.
|
||||
|
||||
Next Week
|
||||
- Find a way to get the full compute flow and data flow in vLLM, then replay in Rust.
|
||||
18
phd/weekly-report/25/250921.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [6/10] Profile vLLM to get compute graph
|
||||
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Learn and implement a simple LLM inference in candle.
|
||||
- [KR1] Debug for the float precision problem in candle, trying to figure out the root cause: kernel library or rust float precision.
|
||||
|
||||
Next Week
|
||||
- Think about the structure of inference framework.
|
||||
- Continue the rust code implementation.
|
||||
18
phd/weekly-report/25/250928.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR2] Rethink about the project target and the definition of IR for automatic distribution optimization.
|
||||
- [KR2] Learn something about category for IR abstraction.
|
||||
- [KR2] Survey the TVM and MLC LLM to learning about their IR abstraction.
|
||||
|
||||
Next Week
|
||||
- Profile the compute and communication time for kernels to show the bubble in micro-batch under different models and different input lengths.
|
||||
18
phd/weekly-report/25/251012.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- Theoretically analyze for the dual batch overlap optimization to show that different models with different hardware should apply different execution flow.
|
||||
- Survey the DBO and hybrid KVCache management in vLLM.
|
||||
- Make some bottom-up things to do in roadmap https://ipads.se.sjtu.edu.cn:1312/wangjh/infer-framework/-/issues/3.
|
||||
|
||||
Next Week
|
||||
- Go through the vLLM codebase to find the feasibility and challenges for auto apply an execution flow for different models.
|
||||
18
phd/weekly-report/25/251019.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Learn about how vLLM implements the DBO. Check the feasibility to apply an execution flow automatically with a generated config.
|
||||
- [misc] Write a paper commentary for SOSP.
|
||||
|
||||
Next Week
|
||||
- Summary the optimizations in Qwen.
|
||||
- Profile model's different stage (module), analyze the overlap status.
|
||||
19
phd/weekly-report/25/251026.md
Normal file
@@ -0,0 +1,19 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [5/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Summary the optimizations for Qwen models about fused_moe kernels, attention optimization and data copy reduction.
|
||||
- [KR1] Survey in Ali about the workflow for parallelism config search.
|
||||
- [misc] Finish 3 homework for courses.
|
||||
|
||||
Next Week
|
||||
- Find the possibility to search configs automatically with AI like alpha evolve.
|
||||
19
phd/weekly-report/25/251102.md
Normal file
@@ -0,0 +1,19 @@
|
||||
Objectives
|
||||
- Auto distributed LLM inference config optimization
|
||||
|
||||
Key Results
|
||||
- [2/10] Build the first version auto tuner system
|
||||
- [5/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Design the auto tuner system structure and draw a figure.
|
||||
- [KR1] Write code for hardware prober and workload profiler.
|
||||
|
||||
Next Week
|
||||
- Continue to build the system with config generator and config tuner.
|
||||
20
phd/weekly-report/25/251109.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [4/10] Build the first version auto tuner system
|
||||
- [5/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Write code for basic config generator and benchmark to check the performance.
|
||||
- [KR1] Trying to find a way to tune the config for better performance.
|
||||
|
||||
Next Week
|
||||
- Benchmark for baseline and some human-tuned configs to prove the necessary of config tuning.
|
||||
- Continue to design a way for auto tuning.
|
||||
20
phd/weekly-report/25/251116.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [5/10] Build the first version auto tuner system
|
||||
- [5/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Struggle to prepare running environment for Qwen3-Max-fp4. Try to fix/bypass a lot of dependencies/code problems.
|
||||
- [mics] Prepare the first version review agent w/ Yingyi. [0b288d64](https://ipads.se.sjtu.edu.cn:1312/shadowpc/deep-review/-/commit/0b288d643301edcb19be6baf394710ce35a2dd74) ~ [57093ff4](https://ipads.se.sjtu.edu.cn:1312/shadowpc/deep-review/-/commit/57093ff4a5782dbfa6e40456b9c0825df5576f8b)
|
||||
|
||||
Next Week
|
||||
- Think about the insight in our system target.
|
||||
- Continue to implement the tuner part in our system.
|
||||
19
phd/weekly-report/25/251123.md
Normal file
@@ -0,0 +1,19 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [6/10] Build the first version auto tuner system
|
||||
- [7/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR2] Benchmark different configs in different hardware, prove that different hardware and different workload will cause different trends of performance change. [5f2c1ec3](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/5f2c1ec3692586031f3ecd452709a034d8217113) ~ [65d05520](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/65d0552020041d5922e13172d9c40f8ef93a3985)
|
||||
- [KR1] Build a precise workload generator from real workload. Benchmark on _quite similar_ generated workloads and find that even the similar workloads still trigger different performance.
|
||||
|
||||
Next Week
|
||||
- Find the root cause of performance gap under similar workloads.
|
||||
19
phd/weekly-report/25/251130.md
Normal file
@@ -0,0 +1,19 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [7/10] Build the first version auto tuner system
|
||||
- [7/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Update workload generator from real workload, not only support different timestamps, but also input_length, output_length and KVCache hit ratio from real workloads. Then benchmark to check whether we can use an abstract spec to replay the similar performance. [b0bcfa63](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/b0bcfa6326f69755aaaf859d89ad2def2409cd48)~[fb1f0848](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/fb1f084815342d6b8379f3b191ed152a3c1cda67)
|
||||
- [KR1] Check the root cause of performance gap under different similar workloads. The difference mainly comes from different inference load.
|
||||
|
||||
Next Week
|
||||
- Update the workload abstraction spec for more precise replayed performance.
|
||||
20
phd/weekly-report/25/251207.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [7/10] Build the first version auto tuner system
|
||||
- [7/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Update workload generator from real workload, give a more precise spec abstraction. [c969f366](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/c969f366b05cad03447e1d7bdd9f30785dd792e4)~[7407149d](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/7407149d1052d3d610fd1fb3e51ce60068ba4981)
|
||||
- [KR1] Benchmark and compare for generated workloads and raw workloads. Find that if input/output length are generated, will cause the performance varies a lot.
|
||||
|
||||
Next Week
|
||||
- Find the root cause for workload performance variation.
|
||||
- Summary the intelligence for auto tuning path.
|
||||
20
phd/weekly-report/25/251214.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [9/10] Build the first version auto tuner system
|
||||
- [7/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- [KR1] Study and summarize the system intelligence, learn the basic way to implement auto tuner.
|
||||
- [KR1] Implement the naive auto tuner framework, which supports to run vLLM with sampled configs, then aggregate the benchmark results as the context for LLM to get proposals from LLM for evolving. [ad0b0fc3](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/ad0b0fc3eb3dea5f91a2c75efc69894fac011301)~[420afa3c](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/420afa3c7a48d19e2d864f212db0efcd86b40ca8)
|
||||
|
||||
Next Week
|
||||
- Benchmark and summarize and performance of auto tuner vs expert.
|
||||
- Survey the heterogenous hardware's utilization in Ali.
|
||||
20
phd/weekly-report/25/251221.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [9/10] Build the first version auto tuner system
|
||||
- [7/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [0/10] Trace vLLM compute graph and data flow
|
||||
- [3/10] Implement a minimal Rust inference framework
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
|
||||
|
||||
Last Week
|
||||
- Refine the story. Focus on heterogenous workloads are classified by labels or input length, which is not enough. We should define a classification method through the grouping of similar performance under the same config.
|
||||
- Prepare slides to summarize the story and what to do next.
|
||||
- Prepare slides for IPADS group meeting.
|
||||
|
||||
Next Week
|
||||
- Run benchmark for current workload classification to prove different classes need different configs to max the goodput.
|
||||
18
phd/weekly-report/25/251228.md
Normal file
@@ -0,0 +1,18 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [2/10] Workload grouping methods
|
||||
- [9/10] Build the first version auto tuner system
|
||||
- [8/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
|
||||
Last Week
|
||||
- [KR1] Run benchmark for different workload classifications and prove that different classification method will shift the best config and different workload groups need different configs to maximize the goodput.
|
||||
- [misc] Prepare for IPADS group meeting presentation.
|
||||
- [misc] Prepare for the ChinaSys presentation.
|
||||
|
||||
Next Week
|
||||
- Define the workload classification space and find the method to group workload.
|
||||
17
phd/weekly-report/260104.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [2/10] Workload grouping methods
|
||||
- [9/10] Build the first version auto tuner system
|
||||
- [8/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
|
||||
Last Week
|
||||
- [KR2] Run AI Tuner under the current Ali workload groups (input length / label), and try to find the insights for building a better AI Tuner.
|
||||
- [misc] Build system for EuroSys Shadow experiment.
|
||||
|
||||
Next Week
|
||||
- Compare the AI Tuner results with Ali's current situation to find more insights for AI Tuner.
|
||||
21
phd/weekly-report/260111.md
Normal file
@@ -0,0 +1,21 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [4/10] Build the agentic tuner system
|
||||
- [10/10] Build the first version auto tuner system
|
||||
- [2/10] Workload grouping methods
|
||||
- [8/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
|
||||
Last Week
|
||||
- [KR1] Refactor the first version of auto tuner system to make it more agentic. [4e3b15b6](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/4e3b15b60819fb61d04148302be68bb66e9dda7b) ~ [095c1edd](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/095c1edda49bfd8dad70bed20e81564c29ae3e8a)
|
||||
- Support a tool library for our tuner system to call
|
||||
- Speedup the tuning time
|
||||
- Support early stop for bad configs
|
||||
- Support LLM to predict the performance trend and reflection
|
||||
|
||||
Next Week
|
||||
- Summarize the advantages and agentic tuner system and continue to optimize it.
|
||||
20
phd/weekly-report/260118.md
Normal file
@@ -0,0 +1,20 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [6/10] Build the agentic tuner system
|
||||
- [10/10] Build the first version auto tuner system
|
||||
- [2/10] Workload grouping methods
|
||||
- [8/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
|
||||
Last Week
|
||||
- [KR1] Update agentic AITuner to support new trace benchmark / new vLLM flags/ objective score. [0a012bdd](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/0a012bdda53086cd24277962abb0cb559bd313bb) ~ [788da3d8](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/788da3d8bc546620e8c76800dfb7070372cb3540)
|
||||
- [KR1] Survey the related works. Some works build an agent for LLM training / storage system / ..., a work use BO for LLM inference config tuning.
|
||||
- [misc] Prepare an open-sourced version of new traces (thinking and coder) and update readme.
|
||||
|
||||
Next Week
|
||||
- Optimize the agentic AITuner.
|
||||
- Test SCOOT as one of the baseline.
|
||||
21
phd/weekly-report/260125.md
Normal file
@@ -0,0 +1,21 @@
|
||||
Objectives
|
||||
- Auto LLM inference config tuner
|
||||
|
||||
Key Results
|
||||
- [8/10] Build the agentic tuner system
|
||||
- [2/10] Paper outline
|
||||
- [10/10] Build the first version auto tuner system
|
||||
- [2/10] Workload grouping methods
|
||||
- [8/10] Check the current situation of parallelism config optimization
|
||||
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
|
||||
- [1/10] Define the IR for automatic optimization
|
||||
- [5/10] Profile different parallelism setup with real trace and analysis their difference
|
||||
|
||||
Last Week
|
||||
- [KR1] Update agentic AITuner to support DP vs replicas, early-stop error handling, fix problems/illegal constrains in large search space. [6c0940e7](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/6c0940e7b0a234265290398fe0a7ca7b7f3d4178) ~ [0cbc1727](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/0cbc1727c06589ea9b021b223883d0fd114fd4c7)
|
||||
- [KR2] Prepare draft for paper outline, summarize current story and what to do next.
|
||||
- [misc] Prepare a [paper template](https://ipads.se.sjtu.edu.cn:1312/wangjh/paper-ai-tuner).
|
||||
- [misc] Open source our new trace and trace-replayer at https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon.
|
||||
|
||||
Next Week
|
||||
- Compare to Ali production environment's configs.
|
||||