Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/phd/papers/IMPRESS.figs/250228-095457.png
+++ b/phd/papers/IMPRESS.figs/250228-095457.png
--- a/phd/papers/IMPRESS.figs/250228-164104.png
+++ b/phd/papers/IMPRESS.figs/250228-164104.png
--- a/phd/papers/IMPRESS.figs/250228-170902.png
+++ b/phd/papers/IMPRESS.figs/250228-170902.png
--- a/phd/papers/IMPRESS.figs/250228-175702.png
+++ b/phd/papers/IMPRESS.figs/250228-175702.png
--- a/phd/papers/IMPRESS.figs/250228-175831.png
+++ b/phd/papers/IMPRESS.figs/250228-175831.png
--- a/phd/papers/IMPRESS.figs/250301-134529.png
+++ b/phd/papers/IMPRESS.figs/250301-134529.png
--- a/phd/papers/IMPRESS.figs/250301-151846.png
+++ b/phd/papers/IMPRESS.figs/250301-151846.png
--- a/phd/papers/IMPRESS.figs/250301-155139.png
+++ b/phd/papers/IMPRESS.figs/250301-155139.png
--- a/phd/papers/IMPRESS.figs/250301-155325.png
+++ b/phd/papers/IMPRESS.figs/250301-155325.png
--- a/phd/papers/IMPRESS.figs/250301-155822.png
+++ b/phd/papers/IMPRESS.figs/250301-155822.png
--- a/phd/papers/IMPRESS.figs/250301-155959.png
+++ b/phd/papers/IMPRESS.figs/250301-155959.png
--- a/phd/papers/IMPRESS.figs/250301-160458.png
+++ b/phd/papers/IMPRESS.figs/250301-160458.png
--- a/phd/papers/IMPRESS.figs/250301-162611.png
+++ b/phd/papers/IMPRESS.figs/250301-162611.png
--- a/phd/papers/IMPRESS.md
+++ b/phd/papers/IMPRESS.md
@@ -0,0 +1,139 @@
+# TL;DR
+
+该工作观察到，KV cache 从 SSD load 到 GPU 的开销很难被 overlap，load 的时间开销占据了 TTFT 的 51%-98%。该工作的出发点是利用 attention 的稀疏性，只 load important KV cache。这就出现了两个 challenges，C1: 如何决定 SSD 上的 KV cache 哪些是 important 的，C2: load important KV cache 时是离散的，带宽利用不高效，基于 KV cache 的稀疏性，现有的 cache policy 也不高效。该工作利用不同 head 间 important tokens 的相似性解决 C1，利用 SSD 上的 KV reorder 和 importance score based policy 解决 C2.
+
+# Motivations
+
+不同 tokens 对应的 KV 具有不同的重要程度，因此我们不需要 load 所有 prefix tokens 的 KV，而是只 load important KV，减少 IO 导致的 overhead。
+
+# Challenges
+
+## 现存 important KV 的判别法依然会引入巨大的 IO overhead
+
+最 naive 的方式，load 所有 KV，计算 attention，判断哪些是 important 的。但是既然所有 tokens 都算完了，为什么还要用 important，无任何收益 ✖️
+
+$\mathrm{attention}(Q, K, V) = \mathrm{softmax}(QK^T)V$ 的稀疏性来源于 $\mathrm{softmax}(QK^T)$ 的稀疏性，因此现有工作只 load 所有的 keys，开销少了一半，但仍然很大 ✖️
+
+记录一段 prefix tokens 哪些是 important tokens，下次 reuse prefix 时直接使用，accuracy 不行，例如 RAG，同一段 prefix tokens（docs），在后面跟不同 query 时，important tokens 是不一样的 ✖️
+![[250228-095457.png]]
+
+## 现有系统不能高效利用 important KV
+
+为了高效地 disk read 和利用 PCIe，现有系统会以 chunk 为单位 read。而一个 chunk 会包含 important KVs 和 unimportant KVs，导致 read amplification，影响性能。
+
+现有的 KV cache manager 都是利用 LRU/LFU 这种策略，忽略了不同 KV cache 重要性的区别。可能导致不关键的 KV cache 占据着 GPU memory。降低了 important KV cache 的 hit ratio，或者导致更多的 transfer 开销。
+
+# Key Observation
+
+## 同一 layer 不同 head 之间的 important tokens 具有相似性
+
+对于两个 set $A, B$，Jaccard Scores 用于定义相似性 $S = \frac{|A \cap B|}{|A \cup B|}$。例如下图 a，相似性为 $\frac{1}{3}$，该工作通过实验 32 heads，交叉的相似性都在 0.9 以上。
+![[250228-164104.png]]
+
+## 不同 important KV 的比例、不同模型的大小下都具有相似性
+
+important KV 的比例的意思是，不同的比例 $r$，我们认为 $r \cdot \text{num\_tokens}$ 个 tokens 对应的 KV 为 important KV。 
+![[250228-170902.png]]
+文中并没有解释、为什么只选 top 10% 的 KV 时，在后一半的 layers 中，相似度明显下降。
+
+> Although smaller models and deeper transformer layers tend to exhibit lower similarities, they are still significantly higher than the expected value from random selection in most cases.
+
+# Solution
+
+## For C1, from O1
+
+既然同一 layer 的 heads 之间的 important KV 具有相似性，可以选择部分 heads 作为探针（probe heads），计算它们的 important KV，代表该 layer 的 important KV。
+
+具体来说，会选取前 3 个 heads（原文 claim：the choice of which three heads to use has no impact on accuracy due to the similarity）作为 probe heads。load 它们的 keys，计算 important tokens id set，并计算它们之间的平均相似性。
+
+如果在相似性超过 threshold，则返回 3 个 tokens id set 中的其中一个，作为全局的 important tokens id set（原文没说怎么在 3 个中选一个，有可能是 random 选一个，或是选择与另两个 head 的平均相似度最高的那个 head）。
+
+如果相似性不足，则 fallback 到默认策略（load all）。实验证明，只有不超过 20% 的 layers 需要 fallback。
+
+![[250228-175702.png]]
+
+### 神秘的超参选择
+
+> Selecting only one probe head to determine the most important token index may introduce bias, affecting model accuracy. Using two probe heads might fail to identify the most important index through voting when disagreements arise. Therefore, we choose to use three probe heads. Increasing the number of probe heads offers minimal improvements in accuracy but increases the keys loading time, thereby extending the TTFT.
+> 
+> ![[250228-175831.png]]
+>
+> $n$ prefix tokens 选择 $k$ important tokens，对于两个任意的 $n$ 选 $k$ 的 token id set $A$ 和 $B$
+> $$j = E(\text{Jaccard(A, B)}) = \frac{E(A \cap B)}{E(A \cup B)} = \frac{E(A \cap B)}{E(A) + E(B) - E(A \cup B)} = \frac{n \cdot \frac{k^2}{n^2}}{2k - n \cdot \frac{k^2}{n^2}} = \frac{\frac{k}{n}}{2 - \frac{k}{n}}$$
+> 选择 $t = j^{\alpha}$ 作为 threshold，通过实验，$\alpha$ 选 0.6 最好
+ 
+## For C2
+
+### KV cache reorder
+
+![[250301-134529.png]]
+- 一个线程周期性地（10 分钟）做一个 block（radix node）内的 reorder，reorder based on the average token importance
+	- 🤔 过去的 token importance 为什么能指导未来的情况？
+- 不在不同 node 间 reorder 的理由：做 prefix match 时，粒度更大，会需要 load 更多的无效 KV cache，导致 IO overhead
+
+### Score-based Cache Management
+
+每个 chunk 的分数由 access frequency 和 important tokens 的占比共同决定。
+
+![[250301-151846.png]]
+
+在 GPU 和 CPU 分别使用一个 min-heap 维护 score，同时保证 GPU 和 CPU 上没有 overlap 的 KV，但是在 disk 上会有全量 KV 的 replica，减少 CPU -> disk 做 evict 时的 IO 开销
+
+# Evaluation
+
+## Setup
+
+- models
+	OPT- 6.7B, OPT-13B, and OPT-30B
+- hardware
+	2 * AMD EPYC 7763 CPUs (64 cores)
+	1 * 128 GB DRAM
+	1 * NVIDIA A100 GPU with 80GB HBM
+	1 * 2TB Intel SSD whose measured read throughput is around 5GB/s
+	PCIe 4.0 * 16 for GPU and CPU connection
+	测试时限制 GPU 上使用 10GB for prefix cache，CPU 上使用 32GB
+- dataset
+	PIQA, RTE, COPA, and OpenBookQA
+
+## Baseline
+
+1. ReComp: recompute all prefix KV cache
+2. AS-like: 作者自己实现的 AttentionStore
+3. AS+H2O+LRU
+4. AS+H2O+LFU
+
+## Results
+
+### Model generation quality
+
+![[250301-155139.png]]
+
+### TTFT
+
+1.2x to 2.8x improvement
+![[250301-155325.png]]
+
+1.5x to 3.8x reduction in I/O time
+![[250301-155822.png]]
+
+### Ablation experiment
+
+ITF: similarity-guided important token identification
+RO: reorder
+All: ITF + RO + score-based cache management
+![[250301-155959.png]]
+
+![[250301-160458.png]]
+
+### Sensitivity Analysis
+
+1. alpha
+2. chunk size
+3. dataset size: 为什么增大 dataset 后，提升效果相对减弱了？只测试到 400GB，如果继续增大是不是对比起来就没有任何提升了？
+4. model type
+
+![[250301-162611.png]]
+
+# Takeaway
+
+- 个人认为这篇工作的优点：paper 写的很通顺、有很多简单的例子帮助读者 follow 它们的 idea，实验做的比较详细
--- a/Hypervisor.figs/251014-144626.png
+++ b/Hypervisor.figs/251014-144626.png
--- a/phd/papers/SOSP'25
+++ b/phd/papers/SOSP'25
@@ -0,0 +1,47 @@
+## Ghost in the Android Shell: Pragmatic Test-oracle Specification of a Production Hypervisor
+
+> Kayvan Memarian, Ben Simner, David Kaloper-Meršinjak, Thibaut Pérami, Peter Sewell (University of Cambridge)
+
+### 一、背景
+
+在现代操作系统和虚拟化平台中，**hypervisor（虚拟机监控器）** 是实现安全隔离的核心组件。本文的讨论的对象 **pKVM**（protected KVM）是 Google 在 Android 上部署的生产级 hypervisor，用于保护 Android host 与 guest 虚拟机之间的隔离。传统的 hypervisor 开发通常依赖**测试 + 手工推理**；而要实现严格的安全性保障，过去的做法是**形式化验证**，但如 seL4、CertiKOS、IronFleet 等方法对大多数生产环境对开发团队来说成功过高。因此，本文提出了一种更轻量的中间方案：通过“**可执行规范 + 运行时测试**”的方式构建对生产 hypervisor 的信任。
+
+### 二、问题与挑战
+
+要在真实的生产 hypervisor（pKVM）中使用“定义规范 + 运行时测试”方法，存在一系列的挑战：
+
+- pKVM 的行为与底层 Arm 硬件架构（页表、异常处理）密切绑定，规范必须能描述这种隐式行为（如硬件页表 walk）。
+- 多个硬件线程可能同时进入 hypervisor；规范需要正确处理锁和状态的所有权。
+- 规范需要松弛，例如，pKVM 对 host 内存映射采用按需映射机制，无法精确指定每一步行为；必须在规范中抽象掉不必要的实现细节。
+- 运行环境受限，由于 pKVM 运行在 EL2，无常规测试工具（如 coverage、调试器）可用，使用随机的错误输入可能导致整个系统崩溃。
+- 规范需要直接在 C 语言中实现（pKVM 所用语言），而不是在 Coq、Lean 之类的形式化语言中；C 缺乏代数数据类型、纯函数子集、pattern matching、泛型等高层特性。
+
+### 三、设计
+
+论文的核心设计是定义“可执行的 test-oracle 规范”，从而能够根据规范：1. 生成符合运行环境规范的随机测试输入，避免纯随机的测试输入导致系统完全 crash；2. 通过 oracle 得到运行测试后 pKVM 的状态，与实际 pKVM 运行后的状态进行比较进行测试。
+
+1. 定义一个抽象状态（Ghost state），作为 hypervisor 真实状态的高层数学模型；ghost state 包含如下的信息。在抽象时过滤掉内存分配、页表内部结构等与外部可观察行为无关的实现细节。
+	- pkvm 自身 stage-1 映射
+	- host stage-2 映射
+	- 每个 VM 的 stage-2 映射与元信息
+	- 全局常量（物理 CPU 数、地址偏移等）
+	- 每个 CPU 的本地状态
+	![[251014-144626.png]]
+2. 通过一组抽象函数在运行时从 pKVM 的真实状态生成 ghost state，同时保证抽象与锁的持有状态严格对应（只有在持锁时才能抽象相关状态）。
+3. 使用 C 语言实现 hypercall 的“纯函数”规范，在 hypercall 执行前后分别记录 pre/post ghost state，然后比较与规范计算结果是否一致。例如 `host_share_hyp`：
+	- 输入：抽象 pre-state + 调用参数 + 环境信息
+	- 输出：抽象 post-state
+	- 只依赖 ghost state，不依赖真实实现
+4. 松弛规范与非确定性：
+	- 对于内存不足错误（`ENOMEM`）等不影响核心语义的行为，规范不强制实现；
+	- 规范参数化于实现返回值，以支持松弛匹配；
+	- 对 host 与 pKVM 共享内存的交互，通过记录实际读取的值来消除非确定性。
+5. 检查抽象状态在 hypercall 之间不发生“锁外干扰”；检查页表 footprint 不被越界修改，实现 separation logic 风格的隔离保证。
+
+### 四、测试与评估
+
+本工作通过修改 Linux 内核引入 “hyp-proxy” 接口，使 pKVM 的 hypercall 能在用户态直接调用，并配合自研的覆盖率工具，对实现与规范的行、分支和函数进行监测。测试主要包括三类：
+1. 手写测试，共 41 个用例，覆盖所有 hypercall 的正常与错误分支
+2. 随机测试，基于 ghost state 生成高质量随机输入以避免系统崩溃
+3. 合成 bug 测试，通过注入错误验证测试框架的有效性
+实际测试中共发现 5 个 pKVM 的真实 bug，包括 allocator 内存对齐、memcache 越界访问、竞争条件和 IO 映射重叠等问题，其中第 4 个是通过规范对比发现的。此外，规范规模约 8.4K 行，接近实现的 11K 行，在 QEMU 环境下内存占用约 18MB，启动时间增加 3.2 倍，测试时间增加 11.5 倍，测试速率达每小时约 20 万次 hypercall。作为总结，本工作提出并验证了一种用“可执行规范”动态对比实现的轻量级验证方法，在不依赖繁重形式化工具的前提下，对生产级 hypervisor（pKVM）实现了实用、低成本的高可信度保障。
--- a/phd/papers/fMoE.figs/250629-153138.png
+++ b/phd/papers/fMoE.figs/250629-153138.png
--- a/phd/papers/fMoE.figs/250629-161518.png
+++ b/phd/papers/fMoE.figs/250629-161518.png
--- a/phd/papers/fMoE.figs/250629-163825.png
+++ b/phd/papers/fMoE.figs/250629-163825.png
--- a/phd/papers/fMoE.figs/250629-172314.png
+++ b/phd/papers/fMoE.figs/250629-172314.png
--- a/phd/papers/fMoE.figs/250629-173919.png
+++ b/phd/papers/fMoE.figs/250629-173919.png
--- a/phd/papers/fMoE.figs/250629-174619.png
+++ b/phd/papers/fMoE.figs/250629-174619.png
--- a/phd/papers/fMoE.figs/250629-175351.png
+++ b/phd/papers/fMoE.figs/250629-175351.png
--- a/phd/papers/fMoE.figs/250629-175536.png
+++ b/phd/papers/fMoE.figs/250629-175536.png
--- a/phd/papers/fMoE.figs/250629-175858.png
+++ b/phd/papers/fMoE.figs/250629-175858.png
--- a/phd/papers/fMoE.figs/250629-221659.png
+++ b/phd/papers/fMoE.figs/250629-221659.png
--- a/phd/papers/fMoE.figs/250629-222203.png
+++ b/phd/papers/fMoE.figs/250629-222203.png
--- a/phd/papers/fMoE.md
+++ b/phd/papers/fMoE.md
@@ -0,0 +1,119 @@
+[fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://arxiv.org/abs/2502.05370)
+
+# TL;DR
+
+由于 MoE 的稀疏稀疏激活特性，且端侧设备通常 GPU 内存十分紧张，因此会选择把当前激活的 expert load 到 GPU 上，这样会带来很大的推理延迟，有一系列工作通过 expert prefetch 来缓解这个问题。
+本工作提出了 fine-grained 的 expert prefetch 和 cache 策略，通过提出新的数据结构 expert map 对历史请求的 expert 激活情况进行存储。对于新请求，使用同时基于 **semantic** 和 **trajectory** 的 expert similarity match 方案，在 expert map store 中与历史记录进行 match，从而进行 expert 的 prefetch。
+![[250629-172314.png]]
+
+# Motivation
+
+现有的 expert offloading 方案不能实现最优的 latency-memory trade-off。
+![[250629-222203.png]]
+
+- 大多数基于 MoE 的 LLM 采用 decoder-only 架构，相较于 encoder-decoder 架构的 LLM，其专家激活模式更为均匀，专家访问的偏斜程度较低。
+- MoE LLM 在训练时引入了一种独特的负载均衡损失函数，该损失函数强制门控网络在同一 MoE 层内平衡分配到各个 expert 的 token 数量，从而确保在整个训练过程中没有 expert 处于闲置状态。这种平衡的路由机制削弱了专家激活模式的可预测性，因此使现有的解决方案不太高效。
+
+coarse-grained 的 expert offloading 会导致：
+- expert hit rate 较低
+- 忽视 MoE 模型和 prompts 的多样性
+![[250629-221659.png]]
+
+因此本工作要解决的三个核心问题：
+- How to maximize expert hit rate when prefetching and offloading experts?
+- How to adapt to different MoE models and prompts?
+- How to avoid additional system overheads when managing experts?
+
+# Design
+
+1. expert map 的信息采集
+![[250629-153138.png]]
+
+2. expert map search
+![[250629-161518.png]]
+
+- Semantic-based expert map search
+$$\text{score}_{x, y}^{\text{sem}} = \frac{\text{sem}_x^{\text{new}} \cdot \text{sem}_y^{\text{new}}}{||\text{sem}_x^{\text{new}}|| \cdot ||\text{sem}_y^{\text{new}}||}$$
+$\text{sem} \in \mathbb{R}^{1 \times H}$
+
+
+- Trajectory-based expert map search
+$$\text{score}_{x, y}^{\text{map}} = \frac{\text{map}_x^{\text{new}} \cdot \text{map}_y^{\text{new}}}{||\text{map}_x^{\text{new}}|| \cdot ||\text{map}_y^{\text{new}}||}$$
+在计算第 $l$ 层时，$\text{map} \in \mathbb{R}^{1 \times (l - 1)J}$，$J$ 为每层的 expert 数量
+
+本工作分析了 similarity 和 expert cache hit rate 的相关性：
+![[250629-163825.png]]
+
+3. expert prefetch
+
+$$\delta_l = \text{Clip}(1 - \text{score}, 0, 1)$$
+
+选择 Top-N 个 experts，使得他们的 probability $p_{l, j}$ 之和大于 $\delta_l$：
+$$\sum_{E_{l, j} \in E_{\text{prefetch}}} p_{l, j} \geq \delta_l$$
+且满足，$K \leq N \leq J$
+
+
+4. expert map storage deduplication
+
+对于新 batch 中的每个 $x$，当前 expert map storage 中的每个 $y$，可以计算 redundancy：
+$$\text{RDY}_{x, y} = \frac{d}{L} \text{score}^{sem}_{x, y} + \frac{L - d}{L} \text{score}^{map}_{x, y}$$
+
+expert map 去重可以形式化为 Minimum Sphere Covering 问题
+
+现有研究证明：维护至少 $2LJ$ 个 expert maps 保证新输入至少能找到 75% 以上相似的 expert map，$\frac{1}{2} LJ\ln(LJ)$ 个时能达到 98%。当 $L \leq 128, J \leq 96$ 时，只需维护不超过 50K 个 expert maps，大约 200MB CPU memory。
+
+5. expert cache and eviction
+
+- expert prefetching priority
+对于每个 $E_{l, j} \in E_{\text{prefetch}}$:
+$$\text{Pri}_{l, j}^{\text{prefetch}} = \frac{p_{l, j}}{l - l_{\text{now}}}$$
+- expert eviction priority
+$$\text{Pri}_{l, j}^{\text{evict}} = \frac{1}{p_{l, j} \cdot freq_{l, j}}$$
+evict 时不考虑 LRU，因为违背了 layers 向前计算的顺序性本质
+
+# Evaluation
+
+Testbed
+- 6 * RTX 3090 with 24GB GPU memory
+- PCIe 4.0 with 32GB/s
+
+Models
+- Mixtral 8x7B
+- Qwen1.5-MoE
+- Phi3.5-MoE
+
+Traces
+- LMSYS-Chat-1M
+- ShareGPT
+- Online serving: Microsoft Azure
+
+Baselines
+1. MoE-Infiity
+2. ProMoE
+3. Mixtral-Offloading
+4. DeepSpeed-Inference
+
+- 整体性能测试结果
+
+![[250629-173919.png]]
+
+相比于 DeepSpeed-Inference, Mixtral-Offloading, Pro-MoE, MoE-Infinity，fMoE 将 average TTFT 降低了 44%, 35%, 33%, 30%，将 average TPOT 降低了 70%, 61%, 55%, 48%，将 average expert hit rate 提高了 147%, 11%, 34%, 63%
+
+- online serving 测试结果
+![[250629-174619.png]]
+
+- ablation study：
+	- speculate: Mixtral-Offloading & ProMoE
+	- hit count: MoE Infinity
+	- Map (T): only trajectory
+	- Map (T+S): both trajectory and semantic
+	- Map (T+S+$\delta$): all features
+![[250629-175351.png]]
+
+
+- Sensitivity: prefetch distance (d)
+	When the prefetch distance is small (< 3), fMoE cannot perfectly hide its system delay from the inference process, such as the map matching and expert prefetching, leading to the increase of inference latency. With larger prefetch distances (> 3), fMoE has worse expert hit rates that also degrade the performance. Therefore, we set the prefetch distance d to 3 for evaluating fMoE.
+![[250629-175536.png]]
+
+- overhead 分析：expert prefetching, map matching 和 map update 是异步的 background，实际对 E2E 有影响的 < 30ms（5% 每个 iteration）
+![[250629-175858.png]]