# Prefill 在高 KV Cache Reuse 下的计算/访存分析

## Model & GPU

```
Qwen3-Coder-30B-A3B (MoE 128E top-8)
  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
  FFN: 6144 intermediate per expert, 8 experts active per token
  Active params: ~3B per token

H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
  Ridge point: 37 FLOP/byte
```

## 核心发现：Prefill 即使 95% reuse 仍然是 compute-bound

```
  SeqLen  Reuse%   NewTok   AI (F/B)   Bound       vs Decode AI
   32,000      0%   32,000    23368     COMPUTE      18189x
   32,000     70%    9,600    10045     COMPUTE       7819x
   32,000     90%    3,200     3821     COMPUTE       2974x
   32,000     95%    1,600     1980     COMPUTE       1541x

   64,000      0%   64,000    40758     COMPUTE      26813x
   64,000     70%   19,200    20610     COMPUTE      13559x
   64,000     90%    6,400     8544     COMPUTE       5621x
   64,000     95%    3,200     4549     COMPUTE       2993x

  Decode (always):
   32,000       -        1      1.3     MEMORY           1x
   64,000       -        1      1.5     MEMORY           1x
```

**关键**：
- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37)，始终 memory-bound
- Prefill 即使 95% reuse (只有 5% 新 token)，AI 仍然 >1000 — 远高于 ridge point，依然 compute-bound

## 为什么高 reuse 的 prefill 仍然是 compute-bound？

### 原因：Attention 的计算量与 seq_len 成正比

当有 95% cache reuse (seq_len=64k, new_tokens=3200):
```
  Q projection:   new_tokens × D × D     → 只处理 3200 new tokens ✓
  K,V projection: new_tokens × D × D_kv  → 只处理 3200 new tokens ✓
  
  但 Attention score: new_tokens × seq_len × D_head × H × L
                    = 3200 × 64000 × 128 × 32 × 48
                    → 仍然要对全部 64k context 做注意力计算！

  FFN (MoE):       new_tokens × 3 × D × D_ffn × 2 × K_experts × L
                  = 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
                  → 8 个 expert 的计算量仍然很大
```

KV cache reuse 减少的是：
- K/V projection 的计算（只算 new tokens）
- KV 写入（只写 new tokens）

但 **不减少的是**：
- Q 对全部 context 的 attention（每个 new Q 都要和所有 64k tokens 做 attention）
- MoE FFN 的计算（每个 new token 激活 8 个 expert）

所以 prefill 的 FLOPs 虽然随 reuse 减少，但 **减少的是线性部分（投影），不减少的是二次部分（attention）**。
在长 context 下，二次部分主导，使得即使 95% reuse，AI 仍远高于 ridge point。

## Prefill 什么时候才变成 memory-bound？

```
  SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
  SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
```

只有在 **近乎 100% reuse**（仅 5-10 个 new tokens）时，prefill 才接近 memory-bound。
在实际 agentic trace 中，只有 3% 的请求达到这个程度。

## 对 PD 分离的影响：修正之前的分析

### 之前的错误结论（已修正）
> "Prefill 大部分是 cache lookup 不是 compute"

这是 **错误的**。即使 70% cache reuse，prefill 的 AI 仍然是 decode 的 7000-14000 倍。
Prefill 始终是 compute-bound，decode 始终是 memory-bound。

### 那为什么 PD 分离在我们的实验中没有帮助？

正确的解释不是 "prefill 变成了 memory-bound"，而是：

**1. Cache reuse 大幅减少了 prefill 的绝对计算量**
```
  无 cache: avg 33.6k tokens × prefill compute = X FLOPs
  71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
```
虽然 prefill 仍是 compute-bound，但 **总工作量只有原来的 28%**。
在 8 instance 并行 + cache-aware routing 下，每个 instance 的 prefill 负载非常轻，
不足以产生对 decode 的显著干扰。

**2. MoE 模型的 per-token compute 本身较小**
Active params 只有 3B（全参数的 10%），单个 token 的计算量不大。
对比 Dense 70B 模型，同样的 GPU 上 prefill-decode 干扰会严重得多。

**3. Cache-aware routing 的 "负载均衡" 效应**
当请求被路由到 cache 命中率高的 instance 时，该 instance 的实际 prefill 工作量更小，
自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。

## 对比不同 workload 类型的 roofline 特征

```
                          Prefill AI    Decode AI    PD-Sep 价值
  Dense 70B, Chatbot:     200-1000x      1-2x        HIGH (compute-heavy P 干扰 D)
  Dense 70B, Agent:       100-500x       1-2x        MEDIUM (cache reduces P load)  
  MoE 30B, Chatbot:       100-500x       1-2x        MEDIUM
  MoE 30B, Agent:         50-200x        1-2x        LOW (small active params + cache)
  ← 我们的位置
```

**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
Agentic MoE 模型恰好在两个方面都不利于 PD 分离。

## 实际 trace 的 prefill bound 分布

```
  With actual trace prefix cache pattern (1000 sampled requests):
    Compute-bound prefills: 961 (96%)
    Memory-bound prefills:  37 (3%)    ← 近 100% reuse 的 warm 请求
    (Decode is ALWAYS memory-bound)
```

96% 的 prefill 仍然是 compute-bound，但 **absolute compute 因 cache 大幅降低**。
这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变，但强度大幅降低。