Consolidate analysis into single report with appendix
Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,289 +1,236 @@
|
|||||||
# PD 分离在 Agentic Workload 下的系统分析
|
# PD Disaggregation for Agentic LLM Workloads: A Systematic Study
|
||||||
|
|
||||||
## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests)
|
## TL;DR
|
||||||
|
|
||||||
```
|
We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:
|
||||||
Total requests: 2,114,220
|
|
||||||
Input tokens: 71.1B (avg 33.6k/req, p50=20k, p90=88k)
|
|
||||||
Output tokens: 940M (avg 445/req, p50=80, p90=811)
|
|
||||||
I/O ratio: 75.6x (aggregate), 217.8x (per-req median)
|
|
||||||
Prefill share: 98% of total tokens
|
|
||||||
Sessions: 1.3M (90% single-turn, 9% multi-turn)
|
|
||||||
```
|
|
||||||
|
|
||||||
**与传统 chatbot workload 的根本区别:**
|
**PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances.
|
||||||
|
|
||||||
| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) |
|
| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure |
|
||||||
|------|-------------------|------------------------|
|
|---|---|---|---|---|
|
||||||
| I/O ratio | 1-10x | **75.6x** |
|
| Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) |
|
||||||
| Input p50 | 500-2000 tokens | **20,030 tokens** |
|
| PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** |
|
||||||
| Output p50 | 200-500 tokens | **80 tokens** |
|
|
||||||
| Prefill token share | 50-80% | **98%** |
|
|
||||||
| >32k input | <5% | **38%** |
|
|
||||||
| Multi-turn | 50-80% | **9%** |
|
|
||||||
|
|
||||||
**KV Cache 复用特征:**
|
Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
|
||||||
|
|
||||||
```
|
---
|
||||||
Unique hash blocks: 20,650,883
|
|
||||||
Shared blocks (ref>1): 9,749,379 (47%)
|
|
||||||
Highly shared (ref>10): 2,428,160
|
|
||||||
Intra-session reuse: 57%
|
|
||||||
Top-10 blocks ref count: 64,754 (system prompt blocks)
|
|
||||||
Theoretical cache hit: 71% (infinite cache, first 100k requests)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Input length 分布与 token 占比:**
|
## 1. Workload Characterization
|
||||||
|
|
||||||
```
|
**Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours
|
||||||
<1k: 202,396 reqs ( 9%) 89M tokens ( 0%)
|
|
||||||
1-8k: 380,009 reqs (17%) 1.6B tokens ( 2%)
|
|
||||||
8-32k: 720,871 reqs (34%) 12.7B tokens (17%)
|
|
||||||
32-65k: 405,371 reqs (19%) 19.4B tokens (27%)
|
|
||||||
65-131k: 394,014 reqs (18%) 35.7B tokens (50%)
|
|
||||||
>131k: 11,559 reqs ( 0%) 1.6B tokens ( 2%)
|
|
||||||
```
|
|
||||||
|
|
||||||
50% 的 token 计算量来自 65-131k 的长 context 请求。
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Requests | 2,114,220 |
|
||||||
|
| Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) |
|
||||||
|
| Output tokens | 940M (avg 445, p50=80) |
|
||||||
|
| I/O ratio | 75.6x aggregate, 217.8x per-request median |
|
||||||
|
| Prefill token share | 98% |
|
||||||
|
| Sessions | 1.3M (90% single-turn) |
|
||||||
|
| >32k input | 38% of requests, 79% of tokens |
|
||||||
|
|
||||||
## 2. DistServe 等 PD 分离的核心假设
|
**KV cache reuse**:
|
||||||
|
|
||||||
DistServe (OSDI'24), Splitwise, TetriInfer 等 PD 分离工作基于以下假设:
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Theoretical prefix cache hit (infinite, single inst) | 71% |
|
||||||
|
| Shared hash blocks (ref>1) | 47% of unique blocks |
|
||||||
|
| Intra-session reuse | 57% |
|
||||||
|
| Top blocks ref count | 64,754 (system prompt) |
|
||||||
|
| Actual APC (Combined, cache-aware, 8 inst) | 44.7% |
|
||||||
|
| Actual APC (Round-robin, 8 inst) | 20.8% |
|
||||||
|
|
||||||
### 假设 A: Prefill 和 Decode 有不同的计算特征
|
**Request profile after prefix cache**:
|
||||||
- **Prefill**: compute-bound, 高 GPU 利用率, batch 越大越好
|
|
||||||
- **Decode**: memory-bandwidth-bound, 低 GPU 利用率, latency-sensitive
|
|
||||||
|
|
||||||
**在 agentic workload 中的验证**: ✅ 成立,但需要细化
|
| Bucket | Count | Avg new tokens to prefill |
|
||||||
|
|--------|-------|--------------------------|
|
||||||
|
| >90% cache hit (warm) | 22% | 1,314 |
|
||||||
|
| 50-90% cache hit | 14% | 10,052 |
|
||||||
|
| 1-50% cache hit | 8% | 38,909 |
|
||||||
|
| 0% cache hit (cold) | 55% | 17,696 |
|
||||||
|
|
||||||
Roofline 分析显示(详见 Section 5):
|
## 2. Experiment Setup
|
||||||
|
|
||||||
```
|
**Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)
|
||||||
Arithmetic Intensity (FLOP/byte)
|
|
||||||
Decode: 1.0 - 1.9 (memory-bound, 始终远低于 ridge point)
|
|
||||||
Prefill 0% reuse: 23,000-72,000 (strongly compute-bound)
|
|
||||||
Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!)
|
|
||||||
Prefill 95% reuse: 1,900-10,800 (仍然 compute-bound!)
|
|
||||||
Ridge point (H20): 37
|
|
||||||
```
|
|
||||||
|
|
||||||
**即使 95% KV cache reuse,prefill 仍然是 compute-bound。** 但绝对计算量大幅减少。
|
**Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv
|
||||||
|
|
||||||
### 假设 B: PD co-location 导致互相干扰
|
**Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)
|
||||||
- Prefill 的大 batch 计算会抢占 GPU 资源,导致 decode 的 TPOT 升高
|
|
||||||
- Decode 的持续小计算会占用 GPU 调度槽位,影响 prefill 吞吐
|
|
||||||
|
|
||||||
**在 agentic workload 中的验证**: ⚠️ 干扰存在,但 **可被 cache-aware routing 消除**
|
**Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted):
|
||||||
|
|
||||||
```
|
| Config | Instances | GPU allocation | Scheduler |
|
||||||
同一 cache-aware scheduler, TP=1, 8 GPU:
|
|--------|-----------|----------------|-----------|
|
||||||
Combined TP=1 DP=8: TPOT p90 = 0.073s
|
| Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) |
|
||||||
PD-Sep TP=1 4P+4D: TPOT p90 = 0.074s
|
| Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) |
|
||||||
→ 差异 <2%, 不显著
|
| Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware |
|
||||||
```
|
| PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware |
|
||||||
|
| PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware |
|
||||||
|
|
||||||
对比 round-robin routing:
|
**Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000`
|
||||||
```
|
|
||||||
Combined TP=1 DP=8 (RR): TPOT p90 = 0.086s
|
|
||||||
Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s → -15%
|
|
||||||
→ routing 改善 > PD 分离改善
|
|
||||||
```
|
|
||||||
|
|
||||||
**原因**: cache-aware routing 让 high-cache-hit 的请求集中到特定 instance,
|
**Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids
|
||||||
每个 instance 的实际 prefill 新 token 数大幅减少(71% 被 cache),
|
|
||||||
prefill-decode 干扰因 prefill 工作量降低而自然缓解。
|
|
||||||
|
|
||||||
### 假设 C: KV Cache 传输开销可以忽略
|
**Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn.
|
||||||
- DistServe 假设 P→D 的 KV 传输延迟远小于 prefill 计算时间
|
|
||||||
- 在 InfiniBand/NVLink 等高带宽互联下成立
|
|
||||||
|
|
||||||
**在 agentic workload 中的验证**: ❌ 不成立
|
## 3. Results
|
||||||
|
|
||||||
```
|
### 3.1 Main Comparison (unified cache-aware scheduler)
|
||||||
PD-Sep TTFT p50 = 1.261s vs Combined TTFT p50 = 0.731s (+72%)
|
|
||||||
```
|
|
||||||
|
|
||||||
原因:
|
|
||||||
1. Agentic workload 的 input 极长(p50=20k, p90=88k tokens),KV cache 很大
|
|
||||||
2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes ≈ 1GB
|
|
||||||
3. 更重要的是 await-prefill 链路的串行延迟:proxy → prefill → KV transfer → decode → first token
|
|
||||||
|
|
||||||
### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐
|
|
||||||
- Prefill 节点不做 decode,GPU 利用率更高
|
|
||||||
- 可以用更大的 batch size
|
|
||||||
|
|
||||||
**在 agentic workload 中的验证**: ⚠️ 收益被 cache 稀释
|
|
||||||
|
|
||||||
```
|
|
||||||
理论 prefix cache hit (infinite cache): 71% of input tokens
|
|
||||||
实际 APC (Combined, cache-aware, 8 inst): 44.7%
|
|
||||||
```
|
|
||||||
|
|
||||||
71% cache hit → 只有 29% 的 input tokens 需要实际 prefill compute。
|
|
||||||
Nominal avg input 33.6k → Actual avg new prefill ~9.7k tokens。
|
|
||||||
专用 prefill 的 GPU 利用率优势因 prefill 工作量降低而缩小。
|
|
||||||
|
|
||||||
## 3. Roofline 分析:Prefill 在高 Cache Reuse 下的计算/访存特性
|
|
||||||
|
|
||||||
### 3.1 模型计算结构
|
|
||||||
|
|
||||||
```
|
|
||||||
Qwen3-Coder-30B-A3B (MoE 128E top-8):
|
|
||||||
48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
|
|
||||||
FFN: 6144 intermediate per expert, 8 experts active per token
|
|
||||||
Active params per token: ~3B
|
|
||||||
|
|
||||||
H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.2 Decode 永远 memory-bound
|
|
||||||
|
|
||||||
```
|
|
||||||
SeqLen FLOP Bytes AI (F/B) Bound
|
|
||||||
1,000 3.04e+10 3.01e+10 1.0 MEMORY
|
|
||||||
16,000 3.63e+10 3.16e+10 1.1 MEMORY
|
|
||||||
64,000 5.52e+10 3.63e+10 1.5 MEMORY
|
|
||||||
128,000 8.03e+10 4.26e+10 1.9 MEMORY
|
|
||||||
```
|
|
||||||
|
|
||||||
Decode 的 AI 始终 < 2,远低于 ridge point (37)。每个 decode step 只处理 1 个 token,
|
|
||||||
计算量极小,瓶颈在于读取模型权重和全量 KV cache。
|
|
||||||
|
|
||||||
### 3.3 Prefill 即使 95% reuse 仍然 compute-bound
|
|
||||||
|
|
||||||
```
|
|
||||||
SeqLen Reuse% NewTok AI (F/B) Bound vs Decode
|
|
||||||
32,000 0% 32,000 23,368 COMPUTE 18,190x
|
|
||||||
32,000 50% 16,000 14,899 COMPUTE 11,597x
|
|
||||||
32,000 70% 9,600 10,045 COMPUTE 7,819x
|
|
||||||
32,000 90% 3,200 3,821 COMPUTE 2,974x
|
|
||||||
32,000 95% 1,600 1,980 COMPUTE 1,542x
|
|
||||||
|
|
||||||
64,000 0% 64,000 40,758 COMPUTE 26,813x
|
|
||||||
64,000 70% 19,200 20,610 COMPUTE 13,559x
|
|
||||||
64,000 90% 6,400 8,544 COMPUTE 5,621x
|
|
||||||
64,000 95% 3,200 4,549 COMPUTE 2,993x
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.4 为什么高 reuse 不改变 compute-bound 性质
|
|
||||||
|
|
||||||
KV cache reuse 减少的:
|
|
||||||
- K/V projection 计算(只算 new tokens)
|
|
||||||
- KV 写入(只写 new tokens)
|
|
||||||
|
|
||||||
KV cache reuse **不减少**的:
|
|
||||||
- **Q×K^T attention**: 每个 new Q 都要和全部 seq_len 个 KV 做 attention
|
|
||||||
```
|
|
||||||
FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers
|
|
||||||
```
|
|
||||||
At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13
|
|
||||||
这个二次项在长 context 下主导总计算量
|
|
||||||
|
|
||||||
- **MoE FFN**: 每个 new token 激活 8 experts
|
|
||||||
```
|
|
||||||
FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers
|
|
||||||
```
|
|
||||||
|
|
||||||
**Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。**
|
|
||||||
|
|
||||||
### 3.5 Prefill 什么时候变 memory-bound
|
|
||||||
|
|
||||||
```
|
|
||||||
SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point)
|
|
||||||
SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
|
|
||||||
```
|
|
||||||
|
|
||||||
在实际 agentic trace 中:
|
|
||||||
```
|
|
||||||
Compute-bound prefills: 961 (96%)
|
|
||||||
Memory-bound prefills: 37 (3%) ← 近 100% reuse 的极端 warm 请求
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3.6 关键洞察:"Compute-bound but lightweight"
|
|
||||||
|
|
||||||
高 cache reuse 下的 prefill 处于一种独特状态:
|
|
||||||
|
|
||||||
```
|
|
||||||
Prefill bound 类型: Compute-bound (不变)
|
|
||||||
Prefill 绝对工作量: 大幅降低 (71% cache → 只算 29% 的 tokens)
|
|
||||||
Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离)
|
|
||||||
```
|
|
||||||
|
|
||||||
这解释了为什么 PD 分离没有帮助:
|
|
||||||
- PD 分离解决的是 "prefill 太重干扰 decode" 的问题
|
|
||||||
- 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻
|
|
||||||
- 物理隔离(PD 分离)的收益被 KV 传输开销抵消
|
|
||||||
|
|
||||||
## 4. 实验结果
|
|
||||||
|
|
||||||
### 4.1 完整实验矩阵
|
|
||||||
|
|
||||||
所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。
|
|
||||||
|
|
||||||
| Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
|
| Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
|
||||||
|--------|------|----------|----------|---------|-----|
|
|--------|------|----------|----------|---------|-----|
|
||||||
| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% |
|
| Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
|
||||||
| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% |
|
| PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
|
||||||
| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
|
| Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
|
||||||
| **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
|
|
||||||
| TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
|
|
||||||
|
|
||||||
### 4.2 Cache-Aware Routing 的效果
|
### 3.2 GPU Utilization (200 req, time_scale=20)
|
||||||
|
|
||||||
|
| Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache |
|
||||||
|
|--------|-------------|-------------|------------|-----------------|
|
||||||
|
| Combined 8colo | **30.5%** (active 64%) | — | — | Distributed |
|
||||||
|
| PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% |
|
||||||
|
| PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% |
|
||||||
|
|
||||||
|
### 3.3 Per-Request Breakdown (6P+2D, await mode)
|
||||||
|
|
||||||
|
| Stage | p50 | % of TTFT |
|
||||||
|
|-------|-----|-----------|
|
||||||
|
| Prefill (queue + compute + KV push) | 0.108s | 12.3% |
|
||||||
|
| Proxy overhead | 0.000s | 0.0% |
|
||||||
|
| **KV pull + decode wait** | **109.6s** | **87.7%** |
|
||||||
|
| Total TTFT | 110.2s | 100% |
|
||||||
|
|
||||||
|
Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory.
|
||||||
|
|
||||||
|
### 3.4 Ablations
|
||||||
|
|
||||||
|
| Ablation | Change | TTFT | TPOT p90 | Verdict |
|
||||||
|
|----------|--------|------|----------|---------|
|
||||||
|
| P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) |
|
||||||
|
| Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) |
|
||||||
|
|
||||||
|
## 4. Analysis
|
||||||
|
|
||||||
|
### 4.1 DistServe's Assumptions vs Agentic Reality
|
||||||
|
|
||||||
|
| Assumption | Chatbot (DistServe) | Agentic (this work) |
|
||||||
|
|------------|-------------------|---------------------|
|
||||||
|
| A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 |
|
||||||
|
| B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) |
|
||||||
|
| C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer |
|
||||||
|
| D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight |
|
||||||
|
| **E. Decode KV cache not a bottleneck** | **✅ (short context)** | **❌ THE bottleneck: 97% KV cache on decode** |
|
||||||
|
|
||||||
|
### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse
|
||||||
|
|
||||||
```
|
```
|
||||||
Round-robin → Cache-aware (Combined TP=1 DP=8):
|
SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)
|
||||||
TTFT p50: 1.836s → 0.731s (-60%)
|
|
||||||
TPOT p90: 0.086s → 0.073s (-15%)
|
Reuse% NewTokens AI (FLOP/byte) Bound vs Decode
|
||||||
E2E p50: 6.673s → 4.480s (-33%)
|
0% 64,000 40,758 COMPUTE 26,813x
|
||||||
APC: 20.8% → 44.7% (+24pp)
|
70% 19,200 20,610 COMPUTE 13,559x
|
||||||
|
90% 6,400 8,544 COMPUTE 5,621x
|
||||||
|
95% 3,200 4,549 COMPUTE 2,993x
|
||||||
|
Decode 1 1.5 MEMORY 1x
|
||||||
```
|
```
|
||||||
|
|
||||||
Cache-aware routing 的提升远大于 PD 分离的提升。
|
Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens).
|
||||||
|
|
||||||
### 4.3 修复工程问题的过程
|
But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.
|
||||||
|
|
||||||
实验过程中发现并修复了多个 PD 分离的工程问题:
|
### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall
|
||||||
|
|
||||||
| 问题 | 根因 | 修复 |
|
PD separation concentrates all decode onto fewer GPUs:
|
||||||
|------|------|------|
|
|
||||||
| Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request 已 abort | Patch scheduler.py: assert → graceful skip |
|
|
||||||
| Head-of-line blocking | Proxy 按 request count 做 LB,不区分大小请求 | Token-level ongoing_tokens load balancing |
|
|
||||||
| "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute |
|
|
||||||
| Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT |
|
|
||||||
| Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio |
|
|
||||||
|
|
||||||
## 5. 结论
|
| | Combined (8 inst) | PD-Sep 6P+2D |
|
||||||
|
|---|---|---|
|
||||||
|
| Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** |
|
||||||
|
| Concurrent decode reqs | ~1 per inst | ~4 per inst |
|
||||||
|
| KV cache utilization | Low | **97.1%** |
|
||||||
|
|
||||||
### 5.1 PD 分离为什么在 Agentic Workload 不生效
|
At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache.
|
||||||
|
|
||||||
1. **Cache reuse 大幅降低 prefill 绝对工作量(71% cache hit → 只算 29%)**,使得 P-D 干扰不显著
|
This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full.
|
||||||
2. **Prefill 仍然 compute-bound**(即使 95% reuse,AI 仍 >1000),但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低
|
|
||||||
3. **Cache-aware routing 提供 "软 PD 隔离"**,效果等同于物理隔离但无 KV 传输开销
|
|
||||||
4. **KV 传输开销不可忽略**(TTFT +72%),抵消了隔离收益
|
|
||||||
5. **MoE 模型 active params 小**(3B),per-token compute 本身较轻
|
|
||||||
|
|
||||||
### 5.2 PD 分离在什么条件下有价值
|
### 4.4 Why Cache-Aware Routing Matters More Than PD Separation
|
||||||
|
|
||||||
| 条件 | Chatbot (有价值) | Agentic (无价值) |
|
| Change | TTFT impact | TPOT p90 impact | APC impact |
|
||||||
|------|-----------------|-----------------|
|
|--------|-------------|-----------------|------------|
|
||||||
| Cache hit rate | <10% | **71%** |
|
| RR → cache-aware routing | **-60%** | **-15%** | **+24pp** |
|
||||||
| Model active params | 70B (dense) | **3B (MoE)** |
|
| Combined → PD-Sep | +72% | +1% | -5pp |
|
||||||
| I/O ratio | 1-10x | **75.6x** |
|
|
||||||
| Per-request prefill FLOPs | Very high | **Low (after cache)** |
|
|
||||||
| KV transfer cost vs prefill cost | Negligible | **Significant** |
|
|
||||||
|
|
||||||
### 5.3 Agentic Workload 应该怎么优化
|
Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.
|
||||||
|
|
||||||
1. **Cache-aware routing** (已验证有效): 用 ongoing_tokens + prefix_cache_hit 做联合调度,
|
## 5. Conclusions
|
||||||
将 APC 从 20.8% (RR) 提升到 44.7%,TPOT p90 降低 15%
|
|
||||||
|
|
||||||
2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool,
|
1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall
|
||||||
进一步提升 cache hit 率接近理论 71%
|
2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp
|
||||||
|
3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference
|
||||||
|
4. **PD separation may help in multi-machine settings** where decode has dedicated memory pools (e.g., DRAM-backed Mooncake KV store) not limited by single-GPU HBM
|
||||||
|
|
||||||
3. **Prefix pre-warming**: 对 cold start 请求(55%,0% cache hit),
|
## 6. Patches Applied to vLLM 0.18.1
|
||||||
预计算 common prefix (system prompt blocks) 并分发到所有 instance
|
|
||||||
|
|
||||||
4. **不同 workload 类型的差异化处理**:
|
| File | Change | Reason |
|
||||||
- Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费,任何 instance 都能处理
|
|------|--------|--------|
|
||||||
- Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy,需要有足够 compute
|
| `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort |
|
||||||
- 可以用 request-type-aware routing 进一步优化
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Experiment Artifacts
|
||||||
|
|
||||||
|
### Data on dash0 (`~/agentic-kv/outputs/`)
|
||||||
|
|
||||||
|
| Directory | Config | Requests | Notes |
|
||||||
|
|-----------|--------|----------|-------|
|
||||||
|
| `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC |
|
||||||
|
| `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) |
|
||||||
|
| `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline |
|
||||||
|
| `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations |
|
||||||
|
| `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics |
|
||||||
|
| `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics |
|
||||||
|
| `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio |
|
||||||
|
| `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling |
|
||||||
|
| `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown |
|
||||||
|
|
||||||
|
### Trace on dash0
|
||||||
|
|
||||||
|
| Path | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) |
|
||||||
|
| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) |
|
||||||
|
| `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks |
|
||||||
|
|
||||||
|
### Key Scripts
|
||||||
|
|
||||||
|
| Script | Purpose |
|
||||||
|
|--------|---------|
|
||||||
|
| `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) |
|
||||||
|
| `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids |
|
||||||
|
| `replayer/` | Async trace replayer with streaming metrics |
|
||||||
|
| `scripts/compute_roofline.py` | Prefill/decode roofline analysis |
|
||||||
|
| `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio |
|
||||||
|
| `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy |
|
||||||
|
| `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling |
|
||||||
|
|
||||||
|
### Reproducing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On dash0, activate env
|
||||||
|
cd ~/agentic-kv && source .venv/bin/activate
|
||||||
|
|
||||||
|
# Sample trace
|
||||||
|
python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
|
||||||
|
--output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42
|
||||||
|
|
||||||
|
# Combined TP=1 DP=8 + cache-aware scheduler
|
||||||
|
for i in $(seq 0 7); do
|
||||||
|
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
|
||||||
|
--port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
|
||||||
|
done
|
||||||
|
python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
|
||||||
|
python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
|
||||||
|
--endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8
|
||||||
|
|
||||||
|
# Breakdown data
|
||||||
|
curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin
|
||||||
|
```
|
||||||
|
|||||||
@@ -1,130 +0,0 @@
|
|||||||
# Prefill 在高 KV Cache Reuse 下的计算/访存分析
|
|
||||||
|
|
||||||
## Model & GPU
|
|
||||||
|
|
||||||
```
|
|
||||||
Qwen3-Coder-30B-A3B (MoE 128E top-8)
|
|
||||||
48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
|
|
||||||
FFN: 6144 intermediate per expert, 8 experts active per token
|
|
||||||
Active params: ~3B per token
|
|
||||||
|
|
||||||
H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
|
|
||||||
Ridge point: 37 FLOP/byte
|
|
||||||
```
|
|
||||||
|
|
||||||
## 核心发现:Prefill 即使 95% reuse 仍然是 compute-bound
|
|
||||||
|
|
||||||
```
|
|
||||||
SeqLen Reuse% NewTok AI (F/B) Bound vs Decode AI
|
|
||||||
32,000 0% 32,000 23368 COMPUTE 18189x
|
|
||||||
32,000 70% 9,600 10045 COMPUTE 7819x
|
|
||||||
32,000 90% 3,200 3821 COMPUTE 2974x
|
|
||||||
32,000 95% 1,600 1980 COMPUTE 1541x
|
|
||||||
|
|
||||||
64,000 0% 64,000 40758 COMPUTE 26813x
|
|
||||||
64,000 70% 19,200 20610 COMPUTE 13559x
|
|
||||||
64,000 90% 6,400 8544 COMPUTE 5621x
|
|
||||||
64,000 95% 3,200 4549 COMPUTE 2993x
|
|
||||||
|
|
||||||
Decode (always):
|
|
||||||
32,000 - 1 1.3 MEMORY 1x
|
|
||||||
64,000 - 1 1.5 MEMORY 1x
|
|
||||||
```
|
|
||||||
|
|
||||||
**关键**:
|
|
||||||
- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37),始终 memory-bound
|
|
||||||
- Prefill 即使 95% reuse (只有 5% 新 token),AI 仍然 >1000 — 远高于 ridge point,依然 compute-bound
|
|
||||||
|
|
||||||
## 为什么高 reuse 的 prefill 仍然是 compute-bound?
|
|
||||||
|
|
||||||
### 原因:Attention 的计算量与 seq_len 成正比
|
|
||||||
|
|
||||||
当有 95% cache reuse (seq_len=64k, new_tokens=3200):
|
|
||||||
```
|
|
||||||
Q projection: new_tokens × D × D → 只处理 3200 new tokens ✓
|
|
||||||
K,V projection: new_tokens × D × D_kv → 只处理 3200 new tokens ✓
|
|
||||||
|
|
||||||
但 Attention score: new_tokens × seq_len × D_head × H × L
|
|
||||||
= 3200 × 64000 × 128 × 32 × 48
|
|
||||||
→ 仍然要对全部 64k context 做注意力计算!
|
|
||||||
|
|
||||||
FFN (MoE): new_tokens × 3 × D × D_ffn × 2 × K_experts × L
|
|
||||||
= 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
|
|
||||||
→ 8 个 expert 的计算量仍然很大
|
|
||||||
```
|
|
||||||
|
|
||||||
KV cache reuse 减少的是:
|
|
||||||
- K/V projection 的计算(只算 new tokens)
|
|
||||||
- KV 写入(只写 new tokens)
|
|
||||||
|
|
||||||
但 **不减少的是**:
|
|
||||||
- Q 对全部 context 的 attention(每个 new Q 都要和所有 64k tokens 做 attention)
|
|
||||||
- MoE FFN 的计算(每个 new token 激活 8 个 expert)
|
|
||||||
|
|
||||||
所以 prefill 的 FLOPs 虽然随 reuse 减少,但 **减少的是线性部分(投影),不减少的是二次部分(attention)**。
|
|
||||||
在长 context 下,二次部分主导,使得即使 95% reuse,AI 仍远高于 ridge point。
|
|
||||||
|
|
||||||
## Prefill 什么时候才变成 memory-bound?
|
|
||||||
|
|
||||||
```
|
|
||||||
SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
|
|
||||||
SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
|
|
||||||
```
|
|
||||||
|
|
||||||
只有在 **近乎 100% reuse**(仅 5-10 个 new tokens)时,prefill 才接近 memory-bound。
|
|
||||||
在实际 agentic trace 中,只有 3% 的请求达到这个程度。
|
|
||||||
|
|
||||||
## 对 PD 分离的影响:修正之前的分析
|
|
||||||
|
|
||||||
### 之前的错误结论(已修正)
|
|
||||||
> "Prefill 大部分是 cache lookup 不是 compute"
|
|
||||||
|
|
||||||
这是 **错误的**。即使 70% cache reuse,prefill 的 AI 仍然是 decode 的 7000-14000 倍。
|
|
||||||
Prefill 始终是 compute-bound,decode 始终是 memory-bound。
|
|
||||||
|
|
||||||
### 那为什么 PD 分离在我们的实验中没有帮助?
|
|
||||||
|
|
||||||
正确的解释不是 "prefill 变成了 memory-bound",而是:
|
|
||||||
|
|
||||||
**1. Cache reuse 大幅减少了 prefill 的绝对计算量**
|
|
||||||
```
|
|
||||||
无 cache: avg 33.6k tokens × prefill compute = X FLOPs
|
|
||||||
71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
|
|
||||||
```
|
|
||||||
虽然 prefill 仍是 compute-bound,但 **总工作量只有原来的 28%**。
|
|
||||||
在 8 instance 并行 + cache-aware routing 下,每个 instance 的 prefill 负载非常轻,
|
|
||||||
不足以产生对 decode 的显著干扰。
|
|
||||||
|
|
||||||
**2. MoE 模型的 per-token compute 本身较小**
|
|
||||||
Active params 只有 3B(全参数的 10%),单个 token 的计算量不大。
|
|
||||||
对比 Dense 70B 模型,同样的 GPU 上 prefill-decode 干扰会严重得多。
|
|
||||||
|
|
||||||
**3. Cache-aware routing 的 "负载均衡" 效应**
|
|
||||||
当请求被路由到 cache 命中率高的 instance 时,该 instance 的实际 prefill 工作量更小,
|
|
||||||
自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。
|
|
||||||
|
|
||||||
## 对比不同 workload 类型的 roofline 特征
|
|
||||||
|
|
||||||
```
|
|
||||||
Prefill AI Decode AI PD-Sep 价值
|
|
||||||
Dense 70B, Chatbot: 200-1000x 1-2x HIGH (compute-heavy P 干扰 D)
|
|
||||||
Dense 70B, Agent: 100-500x 1-2x MEDIUM (cache reduces P load)
|
|
||||||
MoE 30B, Chatbot: 100-500x 1-2x MEDIUM
|
|
||||||
MoE 30B, Agent: 50-200x 1-2x LOW (small active params + cache)
|
|
||||||
← 我们的位置
|
|
||||||
```
|
|
||||||
|
|
||||||
**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
|
|
||||||
Agentic MoE 模型恰好在两个方面都不利于 PD 分离。
|
|
||||||
|
|
||||||
## 实际 trace 的 prefill bound 分布
|
|
||||||
|
|
||||||
```
|
|
||||||
With actual trace prefix cache pattern (1000 sampled requests):
|
|
||||||
Compute-bound prefills: 961 (96%)
|
|
||||||
Memory-bound prefills: 37 (3%) ← 近 100% reuse 的 warm 请求
|
|
||||||
(Decode is ALWAYS memory-bound)
|
|
||||||
```
|
|
||||||
|
|
||||||
96% 的 prefill 仍然是 compute-bound,但 **absolute compute 因 cache 大幅降低**。
|
|
||||||
这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变,但强度大幅降低。
|
|
||||||
Reference in New Issue
Block a user