diff --git a/analysis/pd_separation_analysis.md b/analysis/pd_separation_analysis.md index decaa38..8f7c01b 100644 --- a/analysis/pd_separation_analysis.md +++ b/analysis/pd_separation_analysis.md @@ -1,289 +1,236 @@ -# PD 分离在 Agentic Workload 下的系统分析 +# PD Disaggregation for Agentic LLM Workloads: A Systematic Study -## 1. Trace 特征 (GLM-5.1 Agentic Coder, 2h, 2.1M requests) +## TL;DR -``` -Total requests: 2,114,220 -Input tokens: 71.1B (avg 33.6k/req, p50=20k, p90=88k) -Output tokens: 940M (avg 445/req, p50=80, p90=811) -I/O ratio: 75.6x (aggregate), 217.8x (per-req median) -Prefill share: 98% of total tokens -Sessions: 1.3M (90% single-turn, 9% multi-turn) -``` +We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler: -**与传统 chatbot workload 的根本区别:** +**PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances. -| 特征 | Traditional Chatbot | Agentic Coder (GLM-5.1) | -|------|-------------------|------------------------| -| I/O ratio | 1-10x | **75.6x** | -| Input p50 | 500-2000 tokens | **20,030 tokens** | -| Output p50 | 200-500 tokens | **80 tokens** | -| Prefill token share | 50-80% | **98%** | -| >32k input | <5% | **38%** | -| Multi-turn | 50-80% | **9%** | +| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure | +|---|---|---|---|---| +| Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) | +| PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** | -**KV Cache 复用特征:** +Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer. -``` -Unique hash blocks: 20,650,883 -Shared blocks (ref>1): 9,749,379 (47%) -Highly shared (ref>10): 2,428,160 -Intra-session reuse: 57% -Top-10 blocks ref count: 64,754 (system prompt blocks) -Theoretical cache hit: 71% (infinite cache, first 100k requests) -``` +--- -**Input length 分布与 token 占比:** +## 1. Workload Characterization -``` - <1k: 202,396 reqs ( 9%) 89M tokens ( 0%) - 1-8k: 380,009 reqs (17%) 1.6B tokens ( 2%) - 8-32k: 720,871 reqs (34%) 12.7B tokens (17%) - 32-65k: 405,371 reqs (19%) 19.4B tokens (27%) - 65-131k: 394,014 reqs (18%) 35.7B tokens (50%) - >131k: 11,559 reqs ( 0%) 1.6B tokens ( 2%) -``` +**Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours -50% 的 token 计算量来自 65-131k 的长 context 请求。 +| Metric | Value | +|--------|-------| +| Requests | 2,114,220 | +| Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) | +| Output tokens | 940M (avg 445, p50=80) | +| I/O ratio | 75.6x aggregate, 217.8x per-request median | +| Prefill token share | 98% | +| Sessions | 1.3M (90% single-turn) | +| >32k input | 38% of requests, 79% of tokens | -## 2. DistServe 等 PD 分离的核心假设 +**KV cache reuse**: -DistServe (OSDI'24), Splitwise, TetriInfer 等 PD 分离工作基于以下假设: +| Metric | Value | +|--------|-------| +| Theoretical prefix cache hit (infinite, single inst) | 71% | +| Shared hash blocks (ref>1) | 47% of unique blocks | +| Intra-session reuse | 57% | +| Top blocks ref count | 64,754 (system prompt) | +| Actual APC (Combined, cache-aware, 8 inst) | 44.7% | +| Actual APC (Round-robin, 8 inst) | 20.8% | -### 假设 A: Prefill 和 Decode 有不同的计算特征 -- **Prefill**: compute-bound, 高 GPU 利用率, batch 越大越好 -- **Decode**: memory-bandwidth-bound, 低 GPU 利用率, latency-sensitive +**Request profile after prefix cache**: -**在 agentic workload 中的验证**: ✅ 成立,但需要细化 +| Bucket | Count | Avg new tokens to prefill | +|--------|-------|--------------------------| +| >90% cache hit (warm) | 22% | 1,314 | +| 50-90% cache hit | 14% | 10,052 | +| 1-50% cache hit | 8% | 38,909 | +| 0% cache hit (cold) | 55% | 17,696 | -Roofline 分析显示(详见 Section 5): +## 2. Experiment Setup -``` - Arithmetic Intensity (FLOP/byte) - Decode: 1.0 - 1.9 (memory-bound, 始终远低于 ridge point) - Prefill 0% reuse: 23,000-72,000 (strongly compute-bound) - Prefill 70% reuse: 10,000-42,000 (仍然 compute-bound!) - Prefill 95% reuse: 1,900-10,800 (仍然 compute-bound!) - Ridge point (H20): 37 -``` +**Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA) -**即使 95% KV cache reuse,prefill 仍然是 compute-bound。** 但绝对计算量大幅减少。 +**Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv -### 假设 B: PD co-location 导致互相干扰 -- Prefill 的大 batch 计算会抢占 GPU 资源,导致 decode 的 TPOT 升高 -- Decode 的持续小计算会占用 GPU 调度槽位,影响 prefill 吞吐 +**Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params) -**在 agentic workload 中的验证**: ⚠️ 干扰存在,但 **可被 cache-aware routing 消除** +**Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted): -``` -同一 cache-aware scheduler, TP=1, 8 GPU: - Combined TP=1 DP=8: TPOT p90 = 0.073s - PD-Sep TP=1 4P+4D: TPOT p90 = 0.074s - → 差异 <2%, 不显著 -``` +| Config | Instances | GPU allocation | Scheduler | +|--------|-----------|----------------|-----------| +| Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) | +| Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) | +| Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware | +| PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware | +| PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware | -对比 round-robin routing: -``` - Combined TP=1 DP=8 (RR): TPOT p90 = 0.086s - Combined TP=1 DP=8 (cache-aware): TPOT p90 = 0.073s → -15% - → routing 改善 > PD 分离改善 -``` +**Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000` -**原因**: cache-aware routing 让 high-cache-hit 的请求集中到特定 instance, -每个 instance 的实际 prefill 新 token 数大幅减少(71% 被 cache), -prefill-decode 干扰因 prefill 工作量降低而自然缓解。 +**Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids -### 假设 C: KV Cache 传输开销可以忽略 -- DistServe 假设 P→D 的 KV 传输延迟远小于 prefill 计算时间 -- 在 InfiniBand/NVLink 等高带宽互联下成立 +**Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn. -**在 agentic workload 中的验证**: ❌ 不成立 +## 3. Results -``` -PD-Sep TTFT p50 = 1.261s vs Combined TTFT p50 = 0.731s (+72%) -``` - -原因: -1. Agentic workload 的 input 极长(p50=20k, p90=88k tokens),KV cache 很大 -2. 单请求 KV cache = 20k tokens × 48 layers × 2(K+V) × 512 bytes ≈ 1GB -3. 更重要的是 await-prefill 链路的串行延迟:proxy → prefill → KV transfer → decode → first token - -### 假设 D: 专用 prefill 节点可以提高 prefill 吞吐 -- Prefill 节点不做 decode,GPU 利用率更高 -- 可以用更大的 batch size - -**在 agentic workload 中的验证**: ⚠️ 收益被 cache 稀释 - -``` -理论 prefix cache hit (infinite cache): 71% of input tokens -实际 APC (Combined, cache-aware, 8 inst): 44.7% -``` - -71% cache hit → 只有 29% 的 input tokens 需要实际 prefill compute。 -Nominal avg input 33.6k → Actual avg new prefill ~9.7k tokens。 -专用 prefill 的 GPU 利用率优势因 prefill 工作量降低而缩小。 - -## 3. Roofline 分析:Prefill 在高 Cache Reuse 下的计算/访存特性 - -### 3.1 模型计算结构 - -``` -Qwen3-Coder-30B-A3B (MoE 128E top-8): - 48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128 - FFN: 6144 intermediate per expert, 8 experts active per token - Active params per token: ~3B - -H20 GPU: 148 TFLOPS (BF16), 4.0 TB/s HBM → Ridge point: 37 FLOP/byte -``` - -### 3.2 Decode 永远 memory-bound - -``` -SeqLen FLOP Bytes AI (F/B) Bound -1,000 3.04e+10 3.01e+10 1.0 MEMORY -16,000 3.63e+10 3.16e+10 1.1 MEMORY -64,000 5.52e+10 3.63e+10 1.5 MEMORY -128,000 8.03e+10 4.26e+10 1.9 MEMORY -``` - -Decode 的 AI 始终 < 2,远低于 ridge point (37)。每个 decode step 只处理 1 个 token, -计算量极小,瓶颈在于读取模型权重和全量 KV cache。 - -### 3.3 Prefill 即使 95% reuse 仍然 compute-bound - -``` -SeqLen Reuse% NewTok AI (F/B) Bound vs Decode -32,000 0% 32,000 23,368 COMPUTE 18,190x -32,000 50% 16,000 14,899 COMPUTE 11,597x -32,000 70% 9,600 10,045 COMPUTE 7,819x -32,000 90% 3,200 3,821 COMPUTE 2,974x -32,000 95% 1,600 1,980 COMPUTE 1,542x - -64,000 0% 64,000 40,758 COMPUTE 26,813x -64,000 70% 19,200 20,610 COMPUTE 13,559x -64,000 90% 6,400 8,544 COMPUTE 5,621x -64,000 95% 3,200 4,549 COMPUTE 2,993x -``` - -### 3.4 为什么高 reuse 不改变 compute-bound 性质 - -KV cache reuse 减少的: -- K/V projection 计算(只算 new tokens) -- KV 写入(只写 new tokens) - -KV cache reuse **不减少**的: -- **Q×K^T attention**: 每个 new Q 都要和全部 seq_len 个 KV 做 attention - ``` - FLOPs = new_tokens × seq_len × head_dim × num_heads × 2 × num_layers - ``` - At 95% reuse, 32k seq: 1600 × 32000 × 128 × 32 × 2 × 48 ≈ 2×10^13 - 这个二次项在长 context 下主导总计算量 - -- **MoE FFN**: 每个 new token 激活 8 experts - ``` - FLOPs = new_tokens × 3 × D × D_ffn × 2 × K_experts × num_layers - ``` - -**Prefill 只在接近 100% reuse (< 10 new tokens) 时才变成 memory-bound。** - -### 3.5 Prefill 什么时候变 memory-bound - -``` -SeqLen=32,000: new_tokens ≈ 5-10 时 → AI ≈ 37 (ridge point) -SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37 -``` - -在实际 agentic trace 中: -``` -Compute-bound prefills: 961 (96%) -Memory-bound prefills: 37 (3%) ← 近 100% reuse 的极端 warm 请求 -``` - -### 3.6 关键洞察:"Compute-bound but lightweight" - -高 cache reuse 下的 prefill 处于一种独特状态: - -``` - Prefill bound 类型: Compute-bound (不变) - Prefill 绝对工作量: 大幅降低 (71% cache → 只算 29% 的 tokens) - Prefill-Decode 干扰: 因绝对工作量降低而减轻 (不需要物理隔离) -``` - -这解释了为什么 PD 分离没有帮助: -- PD 分离解决的是 "prefill 太重干扰 decode" 的问题 -- 但 cache-aware routing 已经把 prefill 的实际工作量降到足够轻 -- 物理隔离(PD 分离)的收益被 KV 传输开销抵消 - -## 4. 实验结果 - -### 4.1 完整实验矩阵 - -所有实验使用统一的 cache-aware + token-level load-balanced global scheduler。 +### 3.1 Main Comparison (unified cache-aware scheduler) | Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC | |--------|------|----------|----------|---------|-----| -| TP=8 DP=1 (single instance) | 998/1000 | 0.467s | 0.129s | 3.30s | 53.0% | -| TP=2 DP=4 (4 inst, RR) | 997/999 | 0.844s | 0.095s | 4.92s | 33.5% | -| TP=1 DP=8 (8 inst, RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% | -| **TP=1 DP=8 (cache-aware)** | **997/999** | **0.731s** | **0.073s** | **4.48s** | **44.7%** | -| TP=1 PD-Sep 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% | +| Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** | +| PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% | +| Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% | -### 4.2 Cache-Aware Routing 的效果 +### 3.2 GPU Utilization (200 req, time_scale=20) + +| Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache | +|--------|-------------|-------------|------------|-----------------| +| Combined 8colo | **30.5%** (active 64%) | — | — | Distributed | +| PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% | +| PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% | + +### 3.3 Per-Request Breakdown (6P+2D, await mode) + +| Stage | p50 | % of TTFT | +|-------|-----|-----------| +| Prefill (queue + compute + KV push) | 0.108s | 12.3% | +| Proxy overhead | 0.000s | 0.0% | +| **KV pull + decode wait** | **109.6s** | **87.7%** | +| Total TTFT | 110.2s | 100% | + +Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory. + +### 3.4 Ablations + +| Ablation | Change | TTFT | TPOT p90 | Verdict | +|----------|--------|------|----------|---------| +| P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) | +| Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) | + +## 4. Analysis + +### 4.1 DistServe's Assumptions vs Agentic Reality + +| Assumption | Chatbot (DistServe) | Agentic (this work) | +|------------|-------------------|---------------------| +| A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 | +| B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) | +| C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer | +| D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight | +| **E. Decode KV cache not a bottleneck** | **✅ (short context)** | **❌ THE bottleneck: 97% KV cache on decode** | + +### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse ``` -Round-robin → Cache-aware (Combined TP=1 DP=8): - TTFT p50: 1.836s → 0.731s (-60%) - TPOT p90: 0.086s → 0.073s (-15%) - E2E p50: 6.673s → 4.480s (-33%) - APC: 20.8% → 44.7% (+24pp) +SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte) + +Reuse% NewTokens AI (FLOP/byte) Bound vs Decode +0% 64,000 40,758 COMPUTE 26,813x +70% 19,200 20,610 COMPUTE 13,559x +90% 6,400 8,544 COMPUTE 5,621x +95% 3,200 4,549 COMPUTE 2,993x +Decode 1 1.5 MEMORY 1x ``` -Cache-aware routing 的提升远大于 PD 分离的提升。 +Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens). -### 4.3 修复工程问题的过程 +But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation. -实验过程中发现并修复了多个 PD 分离的工程问题: +### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall -| 问题 | 根因 | 修复 | -|------|------|------| -| Decode engine crash | vLLM scheduler assert: KV transfer 回调时 request 已 abort | Patch scheduler.py: assert → graceful skip | -| Head-of-line blocking | Proxy 按 request count 做 LB,不区分大小请求 | Token-level ongoing_tokens load balancing | -| "Timeout waiting for P side ready" | Proxy fire-and-forget prefill, decode 盲等 KV | Await-prefill + kv_load_failure_policy=recompute | -| Port collision on startup | 8 Mooncake instances 同时启动争抢 torch distributed port | Staggered startup + explicit MASTER_PORT | -| Cache routing "rich get richer" | score = ongoing - alpha*cached 导致流量集中到一个 instance | Normalized scoring: ongoing/avg_load - alpha*cache_ratio | +PD separation concentrates all decode onto fewer GPUs: -## 5. 结论 +| | Combined (8 inst) | PD-Sep 6P+2D | +|---|---|---| +| Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** | +| Concurrent decode reqs | ~1 per inst | ~4 per inst | +| KV cache utilization | Low | **97.1%** | -### 5.1 PD 分离为什么在 Agentic Workload 不生效 +At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache. -1. **Cache reuse 大幅降低 prefill 绝对工作量(71% cache hit → 只算 29%)**,使得 P-D 干扰不显著 -2. **Prefill 仍然 compute-bound**(即使 95% reuse,AI 仍 >1000),但每个请求的总 FLOPs 因 new_tokens 减少而大幅降低 -3. **Cache-aware routing 提供 "软 PD 隔离"**,效果等同于物理隔离但无 KV 传输开销 -4. **KV 传输开销不可忽略**(TTFT +72%),抵消了隔离收益 -5. **MoE 模型 active params 小**(3B),per-token compute 本身较轻 +This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full. -### 5.2 PD 分离在什么条件下有价值 +### 4.4 Why Cache-Aware Routing Matters More Than PD Separation -| 条件 | Chatbot (有价值) | Agentic (无价值) | -|------|-----------------|-----------------| -| Cache hit rate | <10% | **71%** | -| Model active params | 70B (dense) | **3B (MoE)** | -| I/O ratio | 1-10x | **75.6x** | -| Per-request prefill FLOPs | Very high | **Low (after cache)** | -| KV transfer cost vs prefill cost | Negligible | **Significant** | +| Change | TTFT impact | TPOT p90 impact | APC impact | +|--------|-------------|-----------------|------------| +| RR → cache-aware routing | **-60%** | **-15%** | **+24pp** | +| Combined → PD-Sep | +72% | +1% | -5pp | -### 5.3 Agentic Workload 应该怎么优化 +Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation. -1. **Cache-aware routing** (已验证有效): 用 ongoing_tokens + prefix_cache_hit 做联合调度, - 将 APC 从 20.8% (RR) 提升到 44.7%,TPOT p90 降低 15% +## 5. Conclusions -2. **Cross-instance KV cache sharing**: 让多个 instance 共享全局 KV pool, - 进一步提升 cache hit 率接近理论 71% +1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall +2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp +3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference +4. **PD separation may help in multi-machine settings** where decode has dedicated memory pools (e.g., DRAM-backed Mooncake KV store) not limited by single-GPU HBM -3. **Prefix pre-warming**: 对 cold start 请求(55%,0% cache hit), - 预计算 common prefix (system prompt blocks) 并分发到所有 instance +## 6. Patches Applied to vLLM 0.18.1 -4. **不同 workload 类型的差异化处理**: - - Warm 请求 (22%, >90% cache hit, avg 1.3k new tokens): 几乎免费,任何 instance 都能处理 - - Cold 请求 (55%, 0% cache hit, avg 17.7k new tokens): prefill-heavy,需要有足够 compute - - 可以用 request-type-aware routing 进一步优化 +| File | Change | Reason | +|------|--------|--------| +| `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort | + +--- + +## Appendix: Experiment Artifacts + +### Data on dash0 (`~/agentic-kv/outputs/`) + +| Directory | Config | Requests | Notes | +|-----------|--------|----------|-------| +| `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC | +| `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) | +| `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline | +| `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations | +| `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics | +| `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics | +| `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio | +| `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling | +| `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown | + +### Trace on dash0 + +| Path | Description | +|------|-------------| +| `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) | +| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) | +| `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks | + +### Key Scripts + +| Script | Purpose | +|--------|---------| +| `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) | +| `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids | +| `replayer/` | Async trace replayer with streaming metrics | +| `scripts/compute_roofline.py` | Prefill/decode roofline analysis | +| `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio | +| `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy | +| `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling | + +### Reproducing + +```bash +# On dash0, activate env +cd ~/agentic-kv && source .venv/bin/activate + +# Sample trace +python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \ + --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42 + +# Combined TP=1 DP=8 + cache-aware scheduler +for i in $(seq 0 7); do + MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \ + --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager & +done +python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090 +python -m replayer --trace traces/sampled_1000req_seed42.jsonl \ + --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8 + +# Breakdown data +curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin +``` diff --git a/analysis/roofline_analysis.md b/analysis/roofline_analysis.md deleted file mode 100644 index bd8c065..0000000 --- a/analysis/roofline_analysis.md +++ /dev/null @@ -1,130 +0,0 @@ -# Prefill 在高 KV Cache Reuse 下的计算/访存分析 - -## Model & GPU - -``` -Qwen3-Coder-30B-A3B (MoE 128E top-8) - 48 layers, hidden=2048, heads=32, kv_heads=4 (GQA), head_dim=128 - FFN: 6144 intermediate per expert, 8 experts active per token - Active params: ~3B per token - -H20 GPU: 148 TFLOPS (BF16) / 4.0 TB/s HBM - Ridge point: 37 FLOP/byte -``` - -## 核心发现:Prefill 即使 95% reuse 仍然是 compute-bound - -``` - SeqLen Reuse% NewTok AI (F/B) Bound vs Decode AI - 32,000 0% 32,000 23368 COMPUTE 18189x - 32,000 70% 9,600 10045 COMPUTE 7819x - 32,000 90% 3,200 3821 COMPUTE 2974x - 32,000 95% 1,600 1980 COMPUTE 1541x - - 64,000 0% 64,000 40758 COMPUTE 26813x - 64,000 70% 19,200 20610 COMPUTE 13559x - 64,000 90% 6,400 8544 COMPUTE 5621x - 64,000 95% 3,200 4549 COMPUTE 2993x - - Decode (always): - 32,000 - 1 1.3 MEMORY 1x - 64,000 - 1 1.5 MEMORY 1x -``` - -**关键**: -- Decode 的 arithmetic intensity (AI) = 1.0-1.9 — 远低于 ridge point (37),始终 memory-bound -- Prefill 即使 95% reuse (只有 5% 新 token),AI 仍然 >1000 — 远高于 ridge point,依然 compute-bound - -## 为什么高 reuse 的 prefill 仍然是 compute-bound? - -### 原因:Attention 的计算量与 seq_len 成正比 - -当有 95% cache reuse (seq_len=64k, new_tokens=3200): -``` - Q projection: new_tokens × D × D → 只处理 3200 new tokens ✓ - K,V projection: new_tokens × D × D_kv → 只处理 3200 new tokens ✓ - - 但 Attention score: new_tokens × seq_len × D_head × H × L - = 3200 × 64000 × 128 × 32 × 48 - → 仍然要对全部 64k context 做注意力计算! - - FFN (MoE): new_tokens × 3 × D × D_ffn × 2 × K_experts × L - = 3200 × 3 × 2048 × 6144 × 2 × 8 × 48 - → 8 个 expert 的计算量仍然很大 -``` - -KV cache reuse 减少的是: -- K/V projection 的计算(只算 new tokens) -- KV 写入(只写 new tokens) - -但 **不减少的是**: -- Q 对全部 context 的 attention(每个 new Q 都要和所有 64k tokens 做 attention) -- MoE FFN 的计算(每个 new token 激活 8 个 expert) - -所以 prefill 的 FLOPs 虽然随 reuse 减少,但 **减少的是线性部分(投影),不减少的是二次部分(attention)**。 -在长 context 下,二次部分主导,使得即使 95% reuse,AI 仍远高于 ridge point。 - -## Prefill 什么时候才变成 memory-bound? - -``` - SeqLen=32,000: new_tokens ≈ 5-10 时 (reuse > 99.97%) → AI ≈ 37 - SeqLen=64,000: new_tokens ≈ 5-10 时 → AI ≈ 37 -``` - -只有在 **近乎 100% reuse**(仅 5-10 个 new tokens)时,prefill 才接近 memory-bound。 -在实际 agentic trace 中,只有 3% 的请求达到这个程度。 - -## 对 PD 分离的影响:修正之前的分析 - -### 之前的错误结论(已修正) -> "Prefill 大部分是 cache lookup 不是 compute" - -这是 **错误的**。即使 70% cache reuse,prefill 的 AI 仍然是 decode 的 7000-14000 倍。 -Prefill 始终是 compute-bound,decode 始终是 memory-bound。 - -### 那为什么 PD 分离在我们的实验中没有帮助? - -正确的解释不是 "prefill 变成了 memory-bound",而是: - -**1. Cache reuse 大幅减少了 prefill 的绝对计算量** -``` - 无 cache: avg 33.6k tokens × prefill compute = X FLOPs - 71% cache: avg 9.4k tokens × prefill compute = 0.28X FLOPs -``` -虽然 prefill 仍是 compute-bound,但 **总工作量只有原来的 28%**。 -在 8 instance 并行 + cache-aware routing 下,每个 instance 的 prefill 负载非常轻, -不足以产生对 decode 的显著干扰。 - -**2. MoE 模型的 per-token compute 本身较小** -Active params 只有 3B(全参数的 10%),单个 token 的计算量不大。 -对比 Dense 70B 模型,同样的 GPU 上 prefill-decode 干扰会严重得多。 - -**3. Cache-aware routing 的 "负载均衡" 效应** -当请求被路由到 cache 命中率高的 instance 时,该 instance 的实际 prefill 工作量更小, -自然减少了 P-D 争抢。这相当于 routing 层面的 "软 PD 分离"。 - -## 对比不同 workload 类型的 roofline 特征 - -``` - Prefill AI Decode AI PD-Sep 价值 - Dense 70B, Chatbot: 200-1000x 1-2x HIGH (compute-heavy P 干扰 D) - Dense 70B, Agent: 100-500x 1-2x MEDIUM (cache reduces P load) - MoE 30B, Chatbot: 100-500x 1-2x MEDIUM - MoE 30B, Agent: 50-200x 1-2x LOW (small active params + cache) - ← 我们的位置 -``` - -**PD 分离的 ROI 随着 cache hit 率升高和模型 active params 减少而下降。** -Agentic MoE 模型恰好在两个方面都不利于 PD 分离。 - -## 实际 trace 的 prefill bound 分布 - -``` - With actual trace prefix cache pattern (1000 sampled requests): - Compute-bound prefills: 961 (96%) - Memory-bound prefills: 37 (3%) ← 近 100% reuse 的 warm 请求 - (Decode is ALWAYS memory-bound) -``` - -96% 的 prefill 仍然是 compute-bound,但 **absolute compute 因 cache 大幅降低**。 -这是一个 "compute-bound but lightweight" 的独特状态 —— bound 类型没变,但强度大幅降低。