Consolidate analysis into single report with appendix

Merged roofline_analysis.md into pd_separation_analysis.md.
Restructured as a self-contained research report:

1. TL;DR with key finding (KV cache memory wall)
2. Workload characterization (trace stats + cache reuse)
3. Experiment setup (hardware, software, configs, scripts)
4. Results (main comparison, GPU util, breakdown, ablations)
5. Analysis (DistServe assumptions, roofline, root cause)
6. Conclusions
7. Appendix: all experiment artifacts, data paths, reproducing steps

One document to read, with pointers to data for deeper analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 00:23:23 +08:00
parent ce616f46d1
commit efa70f05b5
2 changed files with 191 additions and 374 deletions

View File

@@ -1,289 +1,236 @@
# PD 分离在 Agentic Workload 下的系统分析 # PD Disaggregation for Agentic LLM Workloads: A Systematic Study
## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests) ## TL;DR
``` We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:
Total requests: 2,114,220
Input tokens: 71.1B (avg 33.6k/req, p50=20k, p90=88k)
Output tokens: 940M (avg 445/req, p50=80, p90=811)
I/O ratio: 75.6x (aggregate), 217.8x (per-req median)
Prefill share: 98% of total tokens
Sessions: 1.3M (90% single-turn, 9% multi-turn)
```
**与传统 chatbot workload 的根本区别:** **PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances.
| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) | | Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure |
|------|-------------------|------------------------| |---|---|---|---|---|
| I/O ratio | 1-10x | **75.6x** | | Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) |
| Input p50 | 500-2000 tokens | **20,030 tokens** | | PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** |
| Output p50 | 200-500 tokens | **80 tokens** |
| Prefill token share | 50-80% | **98%** |
| >32k input | <5% | **38%** |
| Multi-turn | 50-80% | **9%** |
**KV Cache 复用特征:** Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
``` ---
Unique hash blocks: 20,650,883
Shared blocks (ref>1): 9,749,379 (47%)
Highly shared (ref>10): 2,428,160
Intra-session reuse: 57%
Top-10 blocks ref count: 64,754 (system prompt blocks)
Theoretical cache hit: 71% (infinite cache, first 100k requests)
```
**Input length 分布与 token 占比:** ## 1. Workload Characterization
``` **Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours
<1k: 202,396 reqs ( 9%) 89M tokens ( 0%)
1-8k: 380,009 reqs (17%) 1.6B tokens ( 2%)
8-32k: 720,871 reqs (34%) 12.7B tokens (17%)
32-65k: 405,371 reqs (19%) 19.4B tokens (27%)
65-131k: 394,014 reqs (18%) 35.7B tokens (50%)
>131k: 11,559 reqs ( 0%) 1.6B tokens ( 2%)
```
50% token 计算量来自 65-131k 的长 context 请求 | Metric | Value |
|--------|-------|
| Requests | 2,114,220 |
| Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) |
| Output tokens | 940M (avg 445, p50=80) |
| I/O ratio | 75.6x aggregate, 217.8x per-request median |
| Prefill token share | 98% |
| Sessions | 1.3M (90% single-turn) |
| >32k input | 38% of requests, 79% of tokens |
## 2. DistServe 等 PD 分离的核心假设 **KV cache reuse**:
DistServe (OSDI'24), Splitwise, TetriInfer PD 分离工作基于以下假设 | Metric | Value |
|--------|-------|
| Theoretical prefix cache hit (infinite, single inst) | 71% |
| Shared hash blocks (ref>1) | 47% of unique blocks |
| Intra-session reuse | 57% |
| Top blocks ref count | 64,754 (system prompt) |
| Actual APC (Combined, cache-aware, 8 inst) | 44.7% |
| Actual APC (Round-robin, 8 inst) | 20.8% |
### 假设 A: Prefill 和 Decode 有不同的计算特征 **Request profile after prefix cache**:
- **Prefill**: compute-bound, GPU 利用率, batch 越大越好
- **Decode**: memory-bandwidth-bound, GPU 利用率, latency-sensitive
**在 agentic workload 中的验证**: 成立但需要细化 | Bucket | Count | Avg new tokens to prefill |
|--------|-------|--------------------------|
| >90% cache hit (warm) | 22% | 1,314 |
| 50-90% cache hit | 14% | 10,052 |
| 1-50% cache hit | 8% | 38,909 |
| 0% cache hit (cold) | 55% | 17,696 |
Roofline 分析显示详见 Section 5 ## 2. Experiment Setup
``` **Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)
Arithmetic Intensity (FLOP/byte)
Decode: 1.0 - 1.9 (memory-bound, 始终远低于 ridge point)
Prefill 0% reuse: 23,000-72,000 (strongly compute-bound)
Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!)
Prefill 95% reuse: 1,900-10,800 (仍然 compute-bound!)
Ridge point (H20): 37
```
**即使 95% KV cache reuseprefill 仍然是 compute-bound。** 但绝对计算量大幅减少 **Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv
### 假设 B: PD co-location 导致互相干扰 **Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)
- Prefill 的大 batch 计算会抢占 GPU 资源导致 decode TPOT 升高
- Decode 的持续小计算会占用 GPU 调度槽位影响 prefill 吞吐
**在 agentic workload 中的验证**: 干扰存在 **可被 cache-aware routing 消除** **Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted):
``` | Config | Instances | GPU allocation | Scheduler |
同一 cache-aware scheduler, TP=1, 8 GPU: |--------|-----------|----------------|-----------|
Combined TP=1 DP=8: TPOT p90 = 0.073s | Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) |
PD-Sep TP=1 4P+4D: TPOT p90 = 0.074s | Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) |
→ 差异 <2%, 不显著 | Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware |
``` | PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware |
| PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware |
对比 round-robin routing: **Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000`
```
Combined TP=1 DP=8 (RR): TPOT p90 = 0.086s
Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s → -15%
→ routing 改善 > PD 分离改善
```
**原因**: cache-aware routing high-cache-hit 的请求集中到特定 instance **Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids
每个 instance 的实际 prefill token 数大幅减少71% cache
prefill-decode 干扰因 prefill 工作量降低而自然缓解
### 假设 C: KV Cache 传输开销可以忽略 **Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn.
- DistServe 假设 PD KV 传输延迟远小于 prefill 计算时间
- InfiniBand/NVLink 等高带宽互联下成立
**在 agentic workload 中的验证**: 不成立 ## 3. Results
``` ### 3.1 Main Comparison (unified cache-aware scheduler)
PD-Sep TTFT p50 = 1.261s vs Combined TTFT p50 = 0.731s (+72%)
```
原因
1. Agentic workload input 极长p50=20k, p90=88k tokensKV cache 很大
2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes 1GB
3. 更重要的是 await-prefill 链路的串行延迟proxy prefill KV transfer decode first token
### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐
- Prefill 节点不做 decodeGPU 利用率更高
- 可以用更大的 batch size
**在 agentic workload 中的验证**: 收益被 cache 稀释
```
理论 prefix cache hit (infinite cache): 71% of input tokens
实际 APC (Combined, cache-aware, 8 inst): 44.7%
```
71% cache hit 只有 29% input tokens 需要实际 prefill compute
Nominal avg input 33.6k Actual avg new prefill ~9.7k tokens
专用 prefill GPU 利用率优势因 prefill 工作量降低而缩小
## 3. Roofline 分析Prefill 在高 Cache Reuse 下的计算/访存特性
### 3.1 模型计算结构
```
Qwen3-Coder-30B-A3B (MoE 128E top-8):
48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
FFN: 6144 intermediate per expert, 8 experts active per token
Active params per token: ~3B
H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte
```
### 3.2 Decode 永远 memory-bound
```
SeqLen FLOP Bytes AI (F/B) Bound
1,000 3.04e+10 3.01e+10 1.0 MEMORY
16,000 3.63e+10 3.16e+10 1.1 MEMORY
64,000 5.52e+10 3.63e+10 1.5 MEMORY
128,000 8.03e+10 4.26e+10 1.9 MEMORY
```
Decode AI 始终 < 2远低于 ridge point (37)。每个 decode step 只处理 1 token
计算量极小瓶颈在于读取模型权重和全量 KV cache
### 3.3 Prefill 即使 95% reuse 仍然 compute-bound
```
SeqLen Reuse% NewTok AI (F/B) Bound vs Decode
32,000 0% 32,000 23,368 COMPUTE 18,190x
32,000 50% 16,000 14,899 COMPUTE 11,597x
32,000 70% 9,600 10,045 COMPUTE 7,819x
32,000 90% 3,200 3,821 COMPUTE 2,974x
32,000 95% 1,600 1,980 COMPUTE 1,542x
64,000 0% 64,000 40,758 COMPUTE 26,813x
64,000 70% 19,200 20,610 COMPUTE 13,559x
64,000 90% 6,400 8,544 COMPUTE 5,621x
64,000 95% 3,200 4,549 COMPUTE 2,993x
```
### 3.4 为什么高 reuse 不改变 compute-bound 性质
KV cache reuse 减少的
- K/V projection 计算只算 new tokens
- KV 写入只写 new tokens
KV cache reuse **不减少**
- **Q×K^T attention**: 每个 new Q 都要和全部 seq_len KV attention
```
FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers
```
At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13
这个二次项在长 context 下主导总计算量
- **MoE FFN**: 每个 new token 激活 8 experts
```
FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers
```
**Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。**
### 3.5 Prefill 什么时候变 memory-bound
```
SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point)
SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
```
在实际 agentic trace 中:
```
Compute-bound prefills: 961 (96%)
Memory-bound prefills: 37 (3%) ← 近 100% reuse 的极端 warm 请求
```
### 3.6 关键洞察:"Compute-bound but lightweight"
高 cache reuse 下的 prefill 处于一种独特状态:
```
Prefill bound 类型: Compute-bound (不变)
Prefill 绝对工作量: 大幅降低 (71% cache → 只算 29% 的 tokens)
Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离)
```
这解释了为什么 PD 分离没有帮助:
- PD 分离解决的是 "prefill 太重干扰 decode" 的问题
- 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻
- 物理隔离PD 分离)的收益被 KV 传输开销抵消
## 4. 实验结果
### 4.1 完整实验矩阵
所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。
| Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC | | Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
|--------|------|----------|----------|---------|-----| |--------|------|----------|----------|---------|-----|
| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% | | Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% | | PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% | | Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
| **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** |
| TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
### 4.2 Cache-Aware Routing 的效果 ### 3.2 GPU Utilization (200 req, time_scale=20)
| Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache |
|--------|-------------|-------------|------------|-----------------|
| Combined 8colo | **30.5%** (active 64%) | — | — | Distributed |
| PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% |
| PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% |
### 3.3 Per-Request Breakdown (6P+2D, await mode)
| Stage | p50 | % of TTFT |
|-------|-----|-----------|
| Prefill (queue + compute + KV push) | 0.108s | 12.3% |
| Proxy overhead | 0.000s | 0.0% |
| **KV pull + decode wait** | **109.6s** | **87.7%** |
| Total TTFT | 110.2s | 100% |
Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory.
### 3.4 Ablations
| Ablation | Change | TTFT | TPOT p90 | Verdict |
|----------|--------|------|----------|---------|
| P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) |
| Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) |
## 4. Analysis
### 4.1 DistServe's Assumptions vs Agentic Reality
| Assumption | Chatbot (DistServe) | Agentic (this work) |
|------------|-------------------|---------------------|
| A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 |
| B. PD co-location causes interference | | Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) |
| C. KV transfer cost negligible | (short input) | Avg 33.6k tokens, TTFT +72% from transfer |
| D. Dedicated prefill improves throughput | | 71% cache hit prefill already lightweight |
| **E. Decode KV cache not a bottleneck** | ** (short context)** | ** THE bottleneck: 97% KV cache on decode** |
### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse
``` ```
Round-robin → Cache-aware (Combined TP=1 DP=8): SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)
TTFT p50: 1.836s → 0.731s (-60%)
TPOT p90: 0.086s → 0.073s (-15%) Reuse% NewTokens AI (FLOP/byte) Bound vs Decode
E2E p50: 6.673s → 4.480s (-33%) 0% 64,000 40,758 COMPUTE 26,813x
APC: 20.8% → 44.7% (+24pp) 70% 19,200 20,610 COMPUTE 13,559x
90% 6,400 8,544 COMPUTE 5,621x
95% 3,200 4,549 COMPUTE 2,993x
Decode 1 1.5 MEMORY 1x
``` ```
Cache-aware routing 的提升远大于 PD 分离的提升 Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens).
### 4.3 修复工程问题的过程 But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.
实验过程中发现并修复了多个 PD 分离的工程问题 ### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall
| 问题 | 根因 | 修复 | PD separation concentrates all decode onto fewer GPUs:
|------|------|------|
| Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request abort | Patch scheduler.py: assert graceful skip |
| Head-of-line blocking | Proxy request count LB不区分大小请求 | Token-level ongoing_tokens load balancing |
| "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute |
| Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT |
| Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio |
## 5. 结论 | | Combined (8 inst) | PD-Sep 6P+2D |
|---|---|---|
| Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** |
| Concurrent decode reqs | ~1 per inst | ~4 per inst |
| KV cache utilization | Low | **97.1%** |
### 5.1 PD 分离为什么在 Agentic Workload 不生效 At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache.
1. **Cache reuse 大幅降低 prefill 绝对工作量71% cache hit → 只算 29%**使得 P-D 干扰不显著 This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full.
2. **Prefill 仍然 compute-bound**即使 95% reuseAI >1000但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低
3. **Cache-aware routing 提供 "软 PD 隔离"**,效果等同于物理隔离但无 KV 传输开销
4. **KV 传输开销不可忽略**TTFT +72%),抵消了隔离收益
5. **MoE 模型 active params 小**3Bper-token compute 本身较轻
### 5.2 PD 分离在什么条件下有价值 ### 4.4 Why Cache-Aware Routing Matters More Than PD Separation
| 条件 | Chatbot (有价值) | Agentic (无价值) | | Change | TTFT impact | TPOT p90 impact | APC impact |
|------|-----------------|-----------------| |--------|-------------|-----------------|------------|
| Cache hit rate | <10% | **71%** | | RR → cache-aware routing | **-60%** | **-15%** | **+24pp** |
| Model active params | 70B (dense) | **3B (MoE)** | | Combined → PD-Sep | +72% | +1% | -5pp |
| I/O ratio | 1-10x | **75.6x** |
| Per-request prefill FLOPs | Very high | **Low (after cache)** |
| KV transfer cost vs prefill cost | Negligible | **Significant** |
### 5.3 Agentic Workload 应该怎么优化 Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.
1. **Cache-aware routing** (已验证有效): ongoing_tokens + prefix_cache_hit 做联合调度 ## 5. Conclusions
APC 20.8% (RR) 提升到 44.7%TPOT p90 降低 15%
2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool 1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall
进一步提升 cache hit 率接近理论 71% 2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp
3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference
4. **PD separation may help in multi-machine settings** where decode has dedicated memory pools (e.g., DRAM-backed Mooncake KV store) not limited by single-GPU HBM
3. **Prefix pre-warming**: cold start 请求55%0% cache hit ## 6. Patches Applied to vLLM 0.18.1
预计算 common prefix (system prompt blocks) 并分发到所有 instance
4. **不同 workload 类型的差异化处理**: | File | Change | Reason |
- Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费,任何 instance 都能处理 |------|--------|--------|
- Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy需要有足够 compute | `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort |
- 可以用 request-type-aware routing 进一步优化
---
## Appendix: Experiment Artifacts
### Data on dash0 (`~/agentic-kv/outputs/`)
| Directory | Config | Requests | Notes |
|-----------|--------|----------|-------|
| `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC |
| `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) |
| `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline |
| `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations |
| `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics |
| `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics |
| `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio |
| `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling |
| `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown |
### Trace on dash0
| Path | Description |
|------|-------------|
| `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) |
| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) |
| `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks |
### Key Scripts
| Script | Purpose |
|--------|---------|
| `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) |
| `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids |
| `replayer/` | Async trace replayer with streaming metrics |
| `scripts/compute_roofline.py` | Prefill/decode roofline analysis |
| `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio |
| `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy |
| `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling |
### Reproducing
```bash
# On dash0, activate env
cd ~/agentic-kv && source .venv/bin/activate
# Sample trace
python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42
# Combined TP=1 DP=8 + cache-aware scheduler
for i in $(seq 0 7); do
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
--port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
done
python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
--endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8
# Breakdown data
curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin
```

View File

@@ -1,130 +0,0 @@
# Prefill 在高 KV Cache Reuse 下的计算/访存分析
## Model & GPU
```
Qwen3-Coder-30B-A3B (MoE 128E top-8)
48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128
FFN: 6144 intermediate per expert, 8 experts active per token
Active params: ~3B per token
H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM
Ridge point: 37 FLOP/byte
```
## 核心发现Prefill 即使 95% reuse 仍然是 compute-bound
```
SeqLen Reuse% NewTok AI (F/B) Bound vs Decode AI
32,000 0% 32,000 23368 COMPUTE 18189x
32,000 70% 9,600 10045 COMPUTE 7819x
32,000 90% 3,200 3821 COMPUTE 2974x
32,000 95% 1,600 1980 COMPUTE 1541x
64,000 0% 64,000 40758 COMPUTE 26813x
64,000 70% 19,200 20610 COMPUTE 13559x
64,000 90% 6,400 8544 COMPUTE 5621x
64,000 95% 3,200 4549 COMPUTE 2993x
Decode (always):
32,000 - 1 1.3 MEMORY 1x
64,000 - 1 1.5 MEMORY 1x
```
**关键**
- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37),始终 memory-bound
- Prefill 即使 95% reuse (只有 5% 新 token)AI 仍然 >1000 — 远高于 ridge point依然 compute-bound
## 为什么高 reuse 的 prefill 仍然是 compute-bound
### 原因Attention 的计算量与 seq_len 成正比
当有 95% cache reuse (seq_len=64k, new_tokens=3200):
```
Q projection: new_tokens × D × D → 只处理 3200 new tokens ✓
K,V projection: new_tokens × D × D_kv → 只处理 3200 new tokens ✓
但 Attention score: new_tokens × seq_len × D_head × H × L
= 3200 × 64000 × 128 × 32 × 48
→ 仍然要对全部 64k context 做注意力计算!
FFN (MoE): new_tokens × 3 × D × D_ffn × 2 × K_experts × L
= 3200 × 3 × 2048 × 6144 × 2 × 8 × 48
→ 8 个 expert 的计算量仍然很大
```
KV cache reuse 减少的是:
- K/V projection 的计算(只算 new tokens
- KV 写入(只写 new tokens
**不减少的是**
- Q 对全部 context 的 attention每个 new Q 都要和所有 64k tokens 做 attention
- MoE FFN 的计算(每个 new token 激活 8 个 expert
所以 prefill 的 FLOPs 虽然随 reuse 减少,但 **减少的是线性部分投影不减少的是二次部分attention**
在长 context 下,二次部分主导,使得即使 95% reuseAI 仍远高于 ridge point。
## Prefill 什么时候才变成 memory-bound
```
SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37
SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37
```
只有在 **近乎 100% reuse**(仅 5-10 个 new tokensprefill 才接近 memory-bound。
在实际 agentic trace 中,只有 3% 的请求达到这个程度。
## 对 PD 分离的影响:修正之前的分析
### 之前的错误结论(已修正)
> "Prefill 大部分是 cache lookup 不是 compute"
这是 **错误的**。即使 70% cache reuseprefill 的 AI 仍然是 decode 的 7000-14000 倍。
Prefill 始终是 compute-bounddecode 始终是 memory-bound。
### 那为什么 PD 分离在我们的实验中没有帮助?
正确的解释不是 "prefill 变成了 memory-bound",而是:
**1. Cache reuse 大幅减少了 prefill 的绝对计算量**
```
无 cache: avg 33.6k tokens × prefill compute = X FLOPs
71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs
```
虽然 prefill 仍是 compute-bound**总工作量只有原来的 28%**
在 8 instance 并行 + cache-aware routing 下,每个 instance 的 prefill 负载非常轻,
不足以产生对 decode 的显著干扰。
**2. MoE 模型的 per-token compute 本身较小**
Active params 只有 3B全参数的 10%),单个 token 的计算量不大。
对比 Dense 70B 模型,同样的 GPU 上 prefill-decode 干扰会严重得多。
**3. Cache-aware routing 的 "负载均衡" 效应**
当请求被路由到 cache 命中率高的 instance 时,该 instance 的实际 prefill 工作量更小,
自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。
## 对比不同 workload 类型的 roofline 特征
```
Prefill AI Decode AI PD-Sep 价值
Dense 70B, Chatbot: 200-1000x 1-2x HIGH (compute-heavy P 干扰 D)
Dense 70B, Agent: 100-500x 1-2x MEDIUM (cache reduces P load)
MoE 30B, Chatbot: 100-500x 1-2x MEDIUM
MoE 30B, Agent: 50-200x 1-2x LOW (small active params + cache)
← 我们的位置
```
**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。**
Agentic MoE 模型恰好在两个方面都不利于 PD 分离。
## 实际 trace 的 prefill bound 分布
```
With actual trace prefix cache pattern (1000 sampled requests):
Compute-bound prefills: 961 (96%)
Memory-bound prefills: 37 (3%) ← 近 100% reuse 的 warm 请求
(Decode is ALWAYS memory-bound)
```
96% 的 prefill 仍然是 compute-bound**absolute compute 因 cache 大幅降低**
这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变,但强度大幅降低。