obsidian/phd/papers/IMPRESS.md

# TL;DR

该工作观察到，KV cache 从 SSD load 到 GPU 的开销很难被 overlap，load 的时间开销占据了 TTFT 的 51%-98%。该工作的出发点是利用 attention 的稀疏性，只 load important KV cache。这就出现了两个 challenges，C1: 如何决定 SSD 上的 KV cache 哪些是 important 的，C2: load important KV cache 时是离散的，带宽利用不高效，基于 KV cache 的稀疏性，现有的 cache policy 也不高效。该工作利用不同 head 间 important tokens 的相似性解决 C1，利用 SSD 上的 KV reorder 和 importance score based policy 解决 C2.

# Motivations

不同 tokens 对应的 KV 具有不同的重要程度，因此我们不需要 load 所有 prefix tokens 的 KV，而是只 load important KV，减少 IO 导致的 overhead。

# Challenges

## 现存 important KV 的判别法依然会引入巨大的 IO overhead

最 naive 的方式，load 所有 KV，计算 attention，判断哪些是 important 的。但是既然所有 tokens 都算完了，为什么还要用 important，无任何收益 ✖️

$\mathrm{attention}(Q, K, V) = \mathrm{softmax}(QK^T)V$ 的稀疏性来源于 $\mathrm{softmax}(QK^T)$ 的稀疏性，因此现有工作只 load 所有的 keys，开销少了一半，但仍然很大 ✖️

记录一段 prefix tokens 哪些是 important tokens，下次 reuse prefix 时直接使用，accuracy 不行，例如 RAG，同一段 prefix tokens（docs），在后面跟不同 query 时，important tokens 是不一样的 ✖️
![[250228-095457.png]]

## 现有系统不能高效利用 important KV

为了高效地 disk read 和利用 PCIe，现有系统会以 chunk 为单位 read。而一个 chunk 会包含 important KVs 和 unimportant KVs，导致 read amplification，影响性能。

现有的 KV cache manager 都是利用 LRU/LFU 这种策略，忽略了不同 KV cache 重要性的区别。可能导致不关键的 KV cache 占据着 GPU memory。降低了 important KV cache 的 hit ratio，或者导致更多的 transfer 开销。

# Key Observation

## 同一 layer 不同 head 之间的 important tokens 具有相似性

对于两个 set $A, B$，Jaccard Scores 用于定义相似性 $S = \frac{|A \cap B|}{|A \cup B|}$。例如下图 a，相似性为 $\frac{1}{3}$，该工作通过实验 32 heads，交叉的相似性都在 0.9 以上。
![[250228-164104.png]]

## 不同 important KV 的比例、不同模型的大小下都具有相似性

important KV 的比例的意思是，不同的比例 $r$，我们认为 $r \cdot \text{num\_tokens}$ 个 tokens 对应的 KV 为 important KV。
![[250228-170902.png]]
文中并没有解释、为什么只选 top 10% 的 KV 时，在后一半的 layers 中，相似度明显下降。

> Although smaller models and deeper transformer layers tend to exhibit lower similarities, they are still significantly higher than the expected value from random selection in most cases.

# Solution

## For C1, from O1

既然同一 layer 的 heads 之间的 important KV 具有相似性，可以选择部分 heads 作为探针（probe heads），计算它们的 important KV，代表该 layer 的 important KV。

具体来说，会选取前 3 个 heads（原文 claim：the choice of which three heads to use has no impact on accuracy due to the similarity）作为 probe heads。load 它们的 keys，计算 important tokens id set，并计算它们之间的平均相似性。

如果在相似性超过 threshold，则返回 3 个 tokens id set 中的其中一个，作为全局的 important tokens id set（原文没说怎么在 3 个中选一个，有可能是 random 选一个，或是选择与另两个 head 的平均相似度最高的那个 head）。

如果相似性不足，则 fallback 到默认策略（load all）。实验证明，只有不超过 20% 的 layers 需要 fallback。

![[250228-175702.png]]

### 神秘的超参选择

> Selecting only one probe head to determine the most important token index may introduce bias, affecting model accuracy. Using two probe heads might fail to identify the most important index through voting when disagreements arise. Therefore, we choose to use three probe heads. Increasing the number of probe heads offers minimal improvements in accuracy but increases the keys loading time, thereby extending the TTFT.
>
> ![[250228-175831.png]]
>
> $n$ prefix tokens 选择 $k$ important tokens，对于两个任意的 $n$ 选 $k$ 的 token id set $A$ 和 $B$
> $$j = E(\text{Jaccard(A, B)}) = \frac{E(A \cap B)}{E(A \cup B)} = \frac{E(A \cap B)}{E(A) + E(B) - E(A \cup B)} = \frac{n \cdot \frac{k^2}{n^2}}{2k - n \cdot \frac{k^2}{n^2}} = \frac{\frac{k}{n}}{2 - \frac{k}{n}}$$
> 选择 $t = j^{\alpha}$ 作为 threshold，通过实验，$\alpha$ 选 0.6 最好

## For C2

### KV cache reorder

![[250301-134529.png]]
- 一个线程周期性地（10 分钟）做一个 block（radix node）内的 reorder，reorder based on the average token importance
	- 🤔 过去的 token importance 为什么能指导未来的情况？
- 不在不同 node 间 reorder 的理由：做 prefix match 时，粒度更大，会需要 load 更多的无效 KV cache，导致 IO overhead

### Score-based Cache Management

每个 chunk 的分数由 access frequency 和 important tokens 的占比共同决定。

![[250301-151846.png]]

在 GPU 和 CPU 分别使用一个 min-heap 维护 score，同时保证 GPU 和 CPU 上没有 overlap 的 KV，但是在 disk 上会有全量 KV 的 replica，减少 CPU -> disk 做 evict 时的 IO 开销

# Evaluation

## Setup

- models
	OPT- 6.7B, OPT-13B, and OPT-30B
- hardware
	2 * AMD EPYC 7763 CPUs (64 cores)
	1 * 128 GB DRAM
	1 * NVIDIA A100 GPU with 80GB HBM
	1 * 2TB Intel SSD whose measured read throughput is around 5GB/s
	PCIe 4.0 * 16 for GPU and CPU connection
	测试时限制 GPU 上使用 10GB for prefix cache，CPU 上使用 32GB
- dataset
	PIQA, RTE, COPA, and OpenBookQA

## Baseline

1. ReComp: recompute all prefix KV cache
2. AS-like: 作者自己实现的 AttentionStore
3. AS+H2O+LRU
4. AS+H2O+LFU

## Results

### Model generation quality

![[250301-155139.png]]

### TTFT

1.2x to 2.8x improvement
![[250301-155325.png]]

1.5x to 3.8x reduction in I/O time
![[250301-155822.png]]

### Ablation experiment

ITF: similarity-guided important token identification
RO: reorder
All: ITF + RO + score-based cache management
![[250301-155959.png]]

![[250301-160458.png]]

### Sensitivity Analysis

1. alpha
2. chunk size
3. dataset size: 为什么增大 dataset 后，提升效果相对减弱了？只测试到 400GB，如果继续增大是不是对比起来就没有任何提升了？
4. model type

![[250301-162611.png]]

# Takeaway

- 个人认为这篇工作的优点：paper 写的很通顺、有很多简单的例子帮助读者 follow 它们的 idea，实验做的比较详细