Consolidate analysis into single report with appendix

Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:23:23 +08:00
parent ce616f46d1
commit efa70f05b5
2 changed files with 191 additions and 374 deletions
--- a/analysis/pd_separation_analysis.md
+++ b/analysis/pd_separation_analysis.md
@@ -1,289 +1,236 @@
-# PD 分离在 Agentic Workload 下的系统分析
+# PD Disaggregation for Agentic LLM Workloads: A Systematic Study
-## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests)
+## TL;DR
-```
+We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:
 Total requests:  2,114,220
 Input tokens:    71.1B (avg 33.6k/req, p50=20k, p90=88k)
 Output tokens:   940M  (avg 445/req, p50=80, p90=811)
 I/O ratio:       75.6x (aggregate), 217.8x (per-req median)
 Prefill share:   98% of total tokens
 Sessions:        1.3M (90% single-turn, 9% multi-turn)
 ```
-**与传统 chatbot workload 的根本区别：**
+**PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances.
-| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) |
+| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure |
-|------|-------------------|------------------------|
+|---|---|---|---|---|
-| I/O ratio | 1-10x | **75.6x** |
+| Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) |
-| Input p50 | 500-2000 tokens | **20,030 tokens** |
+| PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** |
 | Output p50 | 200-500 tokens | **80 tokens** |
 | Prefill token share | 50-80% | **98%** |
 | >32k input | <5% | **38%** |
 | Multi-turn | 50-80% | **9%** |
-**KV Cache 复用特征：**
+Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
-```
+---
 Unique hash blocks:        20,650,883
 Shared blocks (ref>1):     9,749,379 (47%)
 Highly shared (ref>10):    2,428,160
 Intra-session reuse:       57%
 Top-10 blocks ref count:   64,754 (system prompt blocks)
 Theoretical cache hit:     71% (infinite cache, first 100k requests)
 ```
-**Input length 分布与 token 占比：**
+## 1. Workload Characterization
-```
+**Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours
     <1k:   202,396 reqs ( 9%)       89M tokens ( 0%)
    1-8k:   380,009 reqs (17%)     1.6B tokens ( 2%)
   8-32k:   720,871 reqs (34%)    12.7B tokens (17%)
  32-65k:   405,371 reqs (19%)    19.4B tokens (27%)
 65-131k:   394,014 reqs (18%)    35.7B tokens (50%)
   >131k:    11,559 reqs ( 0%)     1.6B tokens ( 2%)
 ```
-50% 的 token 计算量来自 65-131k 的长 context 请求。
+| Metric | Value |
 |--------|-------|
 | Requests | 2,114,220 |
 | Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) |
 | Output tokens | 940M (avg 445, p50=80) |
 | I/O ratio | 75.6x aggregate, 217.8x per-request median |
 | Prefill token share | 98% |
 | Sessions | 1.3M (90% single-turn) |
 | >32k input | 38% of requests, 79% of tokens |
-## 2. DistServe 等 PD 分离的核心假设
+**KV cache reuse**:
-DistServe (OSDI'24), Splitwise, TetriInfer 等 PD 分离工作基于以下假设：
+| Metric | Value |
 |--------|-------|
 | Theoretical prefix cache hit (infinite, single inst) | 71% |
 | Shared hash blocks (ref>1) | 47% of unique blocks |
 | Intra-session reuse | 57% |
 | Top blocks ref count | 64,754 (system prompt) |
 | Actual APC (Combined, cache-aware, 8 inst) | 44.7% |
 | Actual APC (Round-robin, 8 inst) | 20.8% |
-### 假设 A: Prefill 和 Decode 有不同的计算特征
+**Request profile after prefix cache**:
 - **Prefill**: compute-bound, 高 GPU 利用率, batch 越大越好
 - **Decode**: memory-bandwidth-bound, 低 GPU 利用率, latency-sensitive
-**在 agentic workload 中的验证**: ✅ 成立，但需要细化
+| Bucket | Count | Avg new tokens to prefill |
 |--------|-------|--------------------------|
 | >90% cache hit (warm) | 22% | 1,314 |
 | 50-90% cache hit | 14% | 10,052 |
 | 1-50% cache hit | 8% | 38,909 |
 | 0% cache hit (cold) | 55% | 17,696 |
-Roofline 分析显示（详见 Section 5）：
+## 2. Experiment Setup
-```
+**Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)
                    Arithmetic Intensity (FLOP/byte)
  Decode:           1.0 - 1.9     (memory-bound, 始终远低于 ridge point)
  Prefill 0% reuse: 23,000-72,000 (strongly compute-bound)
  Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!)
  Prefill 95% reuse: 1,900-10,800  (仍然 compute-bound!)
  Ridge point (H20): 37
 ```
-**即使 95% KV cache reuse，prefill 仍然是 compute-bound。** 但绝对计算量大幅减少。
+**Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv
-### 假设 B: PD co-location 导致互相干扰
+**Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)
 - Prefill 的大 batch 计算会抢占 GPU 资源，导致 decode 的 TPOT 升高
 - Decode 的持续小计算会占用 GPU 调度槽位，影响 prefill 吞吐
-**在 agentic workload 中的验证**: ⚠️ 干扰存在，但 **可被 cache-aware routing 消除**
+**Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted):
-```
+| Config | Instances | GPU allocation | Scheduler |
-同一 cache-aware scheduler, TP=1, 8 GPU:
+|--------|-----------|----------------|-----------|
-  Combined TP=1 DP=8:  TPOT p90 = 0.073s
+| Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) |
-  PD-Sep TP=1 4P+4D:   TPOT p90 = 0.074s
+| Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) |
-  → 差异 <2%, 不显著
+| Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware |
-```
+| PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware |
 | PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware |
-对比 round-robin routing:
+**Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000`
 ```
  Combined TP=1 DP=8 (RR):          TPOT p90 = 0.086s
  Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s  → -15%
  → routing 改善 > PD 分离改善
 ```
-**原因**: cache-aware routing 让 high-cache-hit 的请求集中到特定 instance，
+**Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids
 每个 instance 的实际 prefill 新 token 数大幅减少（71% 被 cache），
 prefill-decode 干扰因 prefill 工作量降低而自然缓解。
-### 假设 C: KV Cache 传输开销可以忽略
+**Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn.
 - DistServe 假设 P→D 的 KV 传输延迟远小于 prefill 计算时间
 - 在 InfiniBand/NVLink 等高带宽互联下成立
-**在 agentic workload 中的验证**: ❌ 不成立
+## 3. Results
-```
+### 3.1 Main Comparison (unified cache-aware scheduler)
 PD-Sep TTFT p50 = 1.261s  vs  Combined TTFT p50 = 0.731s  (+72%)
 ```
 原因：
 1. Agentic workload 的 input 极长（p50=20k, p90=88k tokens），KV cache 很大
 2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes ≈ 1GB
 3. 更重要的是 await-prefill 链路的串行延迟：proxy → prefill → KV transfer → decode → first token
 ### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐
 - Prefill 节点不做 decode，GPU 利用率更高
 - 可以用更大的 batch size
 **在 agentic workload 中的验证**: ⚠️ 收益被 cache 稀释
 ```
 理论 prefix cache hit (infinite cache): 71% of input tokens
 实际 APC (Combined, cache-aware, 8 inst): 44.7%
 ```
 71% cache hit → 只有 29% 的 input tokens 需要实际 prefill compute。
 Nominal avg input 33.6k → Actual avg new prefill ~9.7k tokens。
 专用 prefill 的 GPU 利用率优势因 prefill 工作量降低而缩小。
 ## 3. Roofline 分析：Prefill 在高 Cache Reuse 下的计算/访存特性
 ### 3.1 模型计算结构
 ```
 Qwen3-Coder-30B-A3B (MoE 128E top-8):
  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
  FFN: 6144 intermediate per expert, 8 experts active per token
  Active params per token: ~3B
 H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte
 ```
 ### 3.2 Decode 永远 memory-bound
 ```
 SeqLen    FLOP        Bytes       AI (F/B)    Bound
 1,000     3.04e+10    3.01e+10    1.0         MEMORY
 16,000    3.63e+10    3.16e+10    1.1         MEMORY
 64,000    5.52e+10    3.63e+10    1.5         MEMORY
 128,000   8.03e+10    4.26e+10    1.9         MEMORY
 ```
 Decode 的 AI 始终 < 2，远低于 ridge point (37)。每个 decode step 只处理 1 个 token，
 计算量极小，瓶颈在于读取模型权重和全量 KV cache。
 ### 3.3 Prefill 即使 95% reuse 仍然 compute-bound
 ```
 SeqLen   Reuse%  NewTok    AI (F/B)    Bound       vs Decode
 32,000      0%   32,000     23,368     COMPUTE     18,190x
 32,000     50%   16,000     14,899     COMPUTE     11,597x
 32,000     70%    9,600     10,045     COMPUTE      7,819x
 32,000     90%    3,200      3,821     COMPUTE      2,974x
 32,000     95%    1,600      1,980     COMPUTE      1,542x
 64,000      0%   64,000     40,758     COMPUTE     26,813x
 64,000     70%   19,200     20,610     COMPUTE     13,559x
 64,000     90%    6,400      8,544     COMPUTE      5,621x
 64,000     95%    3,200      4,549     COMPUTE      2,993x
 ```
 ### 3.4 为什么高 reuse 不改变 compute-bound 性质
 KV cache reuse 减少的：
 - K/V projection 计算（只算 new tokens）
 - KV 写入（只写 new tokens）
 KV cache reuse **不减少**的：
 - **Q×K^T attention**: 每个 new Q 都要和全部 seq_len 个 KV 做 attention
  ```
  FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers
  ```
  At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13
  这个二次项在长 context 下主导总计算量
 - **MoE FFN**: 每个 new token 激活 8 experts
  ```
  FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers
  ```
 **Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。**
 ### 3.5 Prefill 什么时候变 memory-bound
 ```
 SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point)
 SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
 ```
 在实际 agentic trace 中：
 ```
 Compute-bound prefills: 961 (96%)
 Memory-bound prefills:  37 (3%)   ← 近 100% reuse 的极端 warm 请求
 ```
 ### 3.6 关键洞察："Compute-bound but lightweight"
 高 cache reuse 下的 prefill 处于一种独特状态：
 ```
  Prefill bound 类型:  Compute-bound (不变)
  Prefill 绝对工作量:  大幅降低 (71% cache → 只算 29% 的 tokens)
  Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离)
 ```
 这解释了为什么 PD 分离没有帮助：
 - PD 分离解决的是 "prefill 太重干扰 decode" 的问题
 - 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻
 - 物理隔离（PD 分离）的收益被 KV 传输开销抵消
 ## 4. 实验结果
 ### 4.1 完整实验矩阵
 所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。
 | Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
 |--------|------|----------|----------|---------|-----|
-| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% |
+| Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
-| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% |
+| PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
-| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
+| Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
 | **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
 | TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
-### 4.2 Cache-Aware Routing 的效果
+### 3.2 GPU Utilization (200 req, time_scale=20)
 | Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache |
 |--------|-------------|-------------|------------|-----------------|
 | Combined 8colo | **30.5%** (active 64%) | — | — | Distributed |
 | PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% |
 | PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% |
 ### 3.3 Per-Request Breakdown (6P+2D, await mode)
 | Stage | p50 | % of TTFT |
 |-------|-----|-----------|
 | Prefill (queue + compute + KV push) | 0.108s | 12.3% |
 | Proxy overhead | 0.000s | 0.0% |
 | **KV pull + decode wait** | **109.6s** | **87.7%** |
 | Total TTFT | 110.2s | 100% |
 Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory.
 ### 3.4 Ablations
 | Ablation | Change | TTFT | TPOT p90 | Verdict |
 |----------|--------|------|----------|---------|
 | P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) |
 | Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) |
 ## 4. Analysis
 ### 4.1 DistServe's Assumptions vs Agentic Reality
 | Assumption | Chatbot (DistServe) | Agentic (this work) |
 |------------|-------------------|---------------------|
 | A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 |
 | B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) |
 | C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer |
 | D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight |
 | **E. Decode KV cache not a bottleneck** | **✅ (short context)** | **❌ THE bottleneck: 97% KV cache on decode** |
 ### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse
 ```
-Round-robin → Cache-aware (Combined TP=1 DP=8):
+SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)
-  TTFT p50: 1.836s → 0.731s  (-60%)
+
-  TPOT p90: 0.086s → 0.073s  (-15%)
+Reuse%   NewTokens   AI (FLOP/byte)   Bound        vs Decode
-  E2E  p50: 6.673s → 4.480s  (-33%)
+0%       64,000      40,758           COMPUTE      26,813x
-  APC:      20.8%  → 44.7%   (+24pp)
+70%      19,200      20,610           COMPUTE      13,559x
 90%       6,400       8,544           COMPUTE       5,621x
 95%       3,200       4,549           COMPUTE       2,993x
 Decode        1         1.5           MEMORY            1x
 ```
-Cache-aware routing 的提升远大于 PD 分离的提升。
+Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens).
-### 4.3 修复工程问题的过程
+But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.
-实验过程中发现并修复了多个 PD 分离的工程问题：
+### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall
-| 问题 | 根因 | 修复 |
+PD separation concentrates all decode onto fewer GPUs:
 |------|------|------|
 | Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request 已 abort | Patch scheduler.py: assert → graceful skip |
 | Head-of-line blocking | Proxy 按 request count 做 LB，不区分大小请求 | Token-level ongoing_tokens load balancing |
 | "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute |
 | Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT |
 | Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio |
-## 5. 结论
+| | Combined (8 inst) | PD-Sep 6P+2D |
 |---|---|---|
 | Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** |
 | Concurrent decode reqs | ~1 per inst | ~4 per inst |
 | KV cache utilization | Low | **97.1%** |
-### 5.1 PD 分离为什么在 Agentic Workload 不生效
+At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache.
-1. **Cache reuse 大幅降低 prefill 绝对工作量（71% cache hit → 只算 29%）**，使得 P-D 干扰不显著
+This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full.
 2. **Prefill 仍然 compute-bound**（即使 95% reuse，AI 仍 >1000），但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低
 3. **Cache-aware routing 提供 "软 PD 隔离"**，效果等同于物理隔离但无 KV 传输开销
 4. **KV 传输开销不可忽略**（TTFT +72%），抵消了隔离收益
 5. **MoE 模型 active params 小**（3B），per-token compute 本身较轻
-### 5.2 PD 分离在什么条件下有价值
+### 4.4 Why Cache-Aware Routing Matters More Than PD Separation
-| 条件 | Chatbot (有价值) | Agentic (无价值) |
+| Change | TTFT impact | TPOT p90 impact | APC impact |
-|------|-----------------|-----------------|
+|--------|-------------|-----------------|------------|
-| Cache hit rate | <10% | **71%** |
+| RR → cache-aware routing | **-60%** | **-15%** | **+24pp** |
-| Model active params | 70B (dense) | **3B (MoE)** |
+| Combined → PD-Sep | +72% | +1% | -5pp |
 | I/O ratio | 1-10x | **75.6x** |
 | Per-request prefill FLOPs | Very high | **Low (after cache)** |
 | KV transfer cost vs prefill cost | Negligible | **Significant** |
-### 5.3 Agentic Workload 应该怎么优化
+Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.
-1. **Cache-aware routing** (已验证有效): 用 ongoing_tokens + prefix_cache_hit 做联合调度，
+## 5. Conclusions
   将 APC 从 20.8% (RR) 提升到 44.7%，TPOT p90 降低 15%
-2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool，
+1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall
-   进一步提升 cache hit 率接近理论 71%
+2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp
 3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference
 4. **PD separation may help in multi-machine settings** where decode has dedicated memory pools (e.g., DRAM-backed Mooncake KV store) not limited by single-GPU HBM
-3. **Prefix pre-warming**: 对 cold start 请求（55%，0% cache hit），
+## 6. Patches Applied to vLLM 0.18.1
   预计算 common prefix (system prompt blocks) 并分发到所有 instance
-4. **不同 workload 类型的差异化处理**:
+| File | Change | Reason |
-   - Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费，任何 instance 都能处理
+|------|--------|--------|
-   - Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy，需要有足够 compute
+| `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort |
-   - 可以用 request-type-aware routing 进一步优化
+
 ---
 ## Appendix: Experiment Artifacts
 ### Data on dash0 (`~/agentic-kv/outputs/`)
 | Directory | Config | Requests | Notes |
 |-----------|--------|----------|-------|
 | `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC |
 | `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) |
 | `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline |
 | `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations |
 | `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics |
 | `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics |
 | `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio |
 | `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling |
 | `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown |
 ### Trace on dash0
 | Path | Description |
 |------|-------------|
 | `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) |
 | `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) |
 | `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks |
 ### Key Scripts
 | Script | Purpose |
 |--------|---------|
 | `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) |
 | `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids |
 | `replayer/` | Async trace replayer with streaming metrics |
 | `scripts/compute_roofline.py` | Prefill/decode roofline analysis |
 | `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio |
 | `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy |
 | `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling |
 ### Reproducing
 ```bash
 # On dash0, activate env
 cd ~/agentic-kv && source .venv/bin/activate
 # Sample trace
 python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42
 # Combined TP=1 DP=8 + cache-aware scheduler
 for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
        --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
 done
 python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
 python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
    --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8
 # Breakdown data
 curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin
 ```
--- a/analysis/roofline_analysis.md
+++ b/analysis/roofline_analysis.md
@@ -1,130 +0,0 @@
 # Prefill 在高 KV Cache Reuse 下的计算/访存分析
 ## Model & GPU
 ```
 Qwen3-Coder-30B-A3B (MoE 128E top-8)
  48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
  FFN: 6144 intermediate per expert, 8 experts active per token
  Active params: ~3B per token
 H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
  Ridge point: 37 FLOP/byte
 ```
 ## 核心发现：Prefill 即使 95% reuse 仍然是 compute-bound
 ```
  SeqLen  Reuse%   NewTok   AI (F/B)   Bound       vs Decode AI
   32,000      0%   32,000    23368     COMPUTE      18189x
   32,000     70%    9,600    10045     COMPUTE       7819x
   32,000     90%    3,200     3821     COMPUTE       2974x
   32,000     95%    1,600     1980     COMPUTE       1541x
   64,000      0%   64,000    40758     COMPUTE      26813x
   64,000     70%   19,200    20610     COMPUTE      13559x
   64,000     90%    6,400     8544     COMPUTE       5621x
   64,000     95%    3,200     4549     COMPUTE       2993x
  Decode (always):
   32,000       -        1      1.3     MEMORY           1x
   64,000       -        1      1.5     MEMORY           1x
 ```
 **关键**：
 - Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37)，始终 memory-bound
 - Prefill 即使 95% reuse (只有 5% 新 token)，AI 仍然 >1000 — 远高于 ridge point，依然 compute-bound
 ## 为什么高 reuse 的 prefill 仍然是 compute-bound？
 ### 原因：Attention 的计算量与 seq_len 成正比
 当有 95% cache reuse (seq_len=64k, new_tokens=3200):
 ```
  Q projection:   new_tokens × D × D     → 只处理 3200 new tokens ✓
  K,V projection: new_tokens × D × D_kv  → 只处理 3200 new tokens ✓
  但 Attention score: new_tokens × seq_len × D_head × H × L
                    = 3200 × 64000 × 128 × 32 × 48
                    → 仍然要对全部 64k context 做注意力计算！
  FFN (MoE):       new_tokens × 3 × D × D_ffn × 2 × K_experts × L
                  = 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
                  → 8 个 expert 的计算量仍然很大
 ```
 KV cache reuse 减少的是：
 - K/V projection 的计算（只算 new tokens）
 - KV 写入（只写 new tokens）
 但 **不减少的是**：
 - Q 对全部 context 的 attention（每个 new Q 都要和所有 64k tokens 做 attention）
 - MoE FFN 的计算（每个 new token 激活 8 个 expert）
 所以 prefill 的 FLOPs 虽然随 reuse 减少，但 **减少的是线性部分（投影），不减少的是二次部分（attention）**。
 在长 context 下，二次部分主导，使得即使 95% reuse，AI 仍远高于 ridge point。
 ## Prefill 什么时候才变成 memory-bound？
 ```
  SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
  SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
 ```
 只有在 **近乎 100% reuse**（仅 5-10 个 new tokens）时，prefill 才接近 memory-bound。
 在实际 agentic trace 中，只有 3% 的请求达到这个程度。
 ## 对 PD 分离的影响：修正之前的分析
 ### 之前的错误结论（已修正）
 > "Prefill 大部分是 cache lookup 不是 compute"
 这是 **错误的**。即使 70% cache reuse，prefill 的 AI 仍然是 decode 的 7000-14000 倍。
 Prefill 始终是 compute-bound，decode 始终是 memory-bound。
 ### 那为什么 PD 分离在我们的实验中没有帮助？
 正确的解释不是 "prefill 变成了 memory-bound"，而是：
 **1. Cache reuse 大幅减少了 prefill 的绝对计算量**
 ```
  无 cache: avg 33.6k tokens × prefill compute = X FLOPs
  71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
 ```
 虽然 prefill 仍是 compute-bound，但 **总工作量只有原来的 28%**。
 在 8 instance 并行 + cache-aware routing 下，每个 instance 的 prefill 负载非常轻，
 不足以产生对 decode 的显著干扰。
 **2. MoE 模型的 per-token compute 本身较小**
 Active params 只有 3B（全参数的 10%），单个 token 的计算量不大。
 对比 Dense 70B 模型，同样的 GPU 上 prefill-decode 干扰会严重得多。
 **3. Cache-aware routing 的 "负载均衡" 效应**
 当请求被路由到 cache 命中率高的 instance 时，该 instance 的实际 prefill 工作量更小，
 自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。
 ## 对比不同 workload 类型的 roofline 特征
 ```
                          Prefill AI    Decode AI    PD-Sep 价值
  Dense 70B, Chatbot:     200-1000x      1-2x        HIGH (compute-heavy P 干扰 D)
  Dense 70B, Agent:       100-500x       1-2x        MEDIUM (cache reduces P load)  
  MoE 30B, Chatbot:       100-500x       1-2x        MEDIUM
  MoE 30B, Agent:         50-200x        1-2x        LOW (small active params + cache)
  ← 我们的位置
 ```
 **PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
 Agentic MoE 模型恰好在两个方面都不利于 PD 分离。
 ## 实际 trace 的 prefill bound 分布
 ```
  With actual trace prefix cache pattern (1000 sampled requests):
    Compute-bound prefills: 961 (96%)
    Memory-bound prefills:  37 (3%)    ← 近 100% reuse 的 warm 请求
    (Decode is ALWAYS memory-bound)
 ```
 96% 的 prefill 仍然是 compute-bound，但 **absolute compute 因 cache 大幅降低**。
 这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变，但强度大幅降低。