# PD Disaggregation for Agentic LLM Workloads: A Systematic Study ## TL;DR We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler: **PD separation is net negative for single-machine agentic workloads.** The root cause is not what prior work (DistServe, Splitwise) targeted — it is a **KV cache memory wall** on decode instances. | Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure | |---|---|---|---|---| | Combined DP=8 (cache-aware) | **0.731s** | **0.073s** | **30.5%** | Low (spread across 8 inst) | | PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | **97.1% on decode** | Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer. **Elastic P2P offload** (selective disaggregation of HEAVY requests only, Mooncake kv_both): under fair same-machine fresh-restart comparison, elastic does NOT improve over baseline. Mooncake kv_both memory overhead outweighs prefill isolation benefit at moderate load. | Config (TP=1, 8×H20, fresh) | TTFT p50 | TPOT p90 | E2E p50 | |---|---|---|---| | Combined DP=8 (baseline) | **1.075s** | **0.076s** | **5.075s** | | Elastic P2P (kv_both, cap=4) | 1.018s | 0.085s | 6.977s | > Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5. | **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** | > ✅⚠️ **2026-05-30 — confirmed + refined by the clean MB5 ablation; one caveat.** > A producer-side contamination (`e13391e`: evicts a producer's prefix-cache on every > KV transfer) was found in the *agentic-kv-fresh* MB5 stack and gated off; everything > was re-run clean. > - **Confirmed:** this doc's central thesis — PD's failure is a **decode-side KV memory > wall**, not transfer/prefill cost — is reproduced on the clean stack (concurrency > axis: at N=64 the split collapses, APC 71%→1.4%, TPS −30%; colo scales). Fig 3. > - **Refined:** "PD separation is net negative" is **regime-dependent**, not universal. > Clean ablation shows PD *wins* at low load / decode-heavy / low-reuse and loses the > **agentic corner** (high reuse + short output + large context + high concurrency). > - **Caveat (cross-check):** if this study's runs used the fork vLLM that contains > `e13391e`, any **producer prefix-cache / APC reuse** figures here (e.g. §5.3.1) may be > understated. The decode-memory-wall result is *not* reuse-dependent and is unaffected. > On the clean stack, session-routed producers reach APC parity with colo (71–82%). > Figures: [`figs/mb5_pd_ablation/`](../figs/mb5_pd_ablation/). --- ## 1. Workload Characterization **Trace**: GLM-5.1 Agentic Coder, production cluster, 2 hours | Metric | Value | |--------|-------| | Requests | 2,114,220 | | Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) | | Output tokens | 940M (avg 445, p50=80) | | I/O ratio | 75.6x aggregate, 217.8x per-request median | | Prefill token share | 98% | | Sessions | 1.3M (90% single-turn) | | >32k input | 38% of requests, 79% of tokens | **KV cache reuse**: | Metric | Value | |--------|-------| | Theoretical prefix cache hit (infinite, single inst) | 71% | | Shared hash blocks (ref>1) | 47% of unique blocks | | Intra-session reuse | 57% | | Top blocks ref count | 64,754 (system prompt) | | Actual APC (Combined, cache-aware, 8 inst) | 44.7% | | Actual APC (Round-robin, 8 inst) | 20.8% | **Request profile after prefix cache**: | Bucket | Count | Avg new tokens to prefill | |--------|-------|--------------------------| | >90% cache hit (warm) | 22% | 1,314 | | 50-90% cache hit | 14% | 10,052 | | 1-50% cache hit | 8% | 38,909 | | 0% cache hit (cold) | 55% | 17,696 | ## 2. Experiment Setup **Hardware**: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA) **Software**: vLLM 0.18.1 (source in `third_party/vllm/`, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv **Model**: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params) **Configurations tested** (all use same cache-aware + token-level LB global scheduler unless noted): | Config | Instances | GPU allocation | Scheduler | |--------|-----------|----------------|-----------| | Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) | | Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) | | Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware | | PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware | | PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware | **Benchmark params**: 1000 sampled requests (200 for ablations), `--enforce-eager`, `--max-model-len 200000` **Trace sampler**: `scripts/sample_trace.py` — random session sampling preserving multi-turn structure + hash_ids **Global scheduler**: `scripts/cache_aware_proxy.py` — supports both `--combined` (PD-colo) and `--prefill/--decode` (PD-sep) modes. Score = `ongoing_tokens/avg_load - α·cache_hit_ratio`, session affinity for multi-turn. ## 3. Results ### 3.1 Main Comparison (unified cache-aware scheduler) | Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC | |--------|------|----------|----------|---------|-----| | Combined TP=1 DP=8 (cache-aware) | 997/999 | **0.731s** | **0.073s** | **4.48s** | **44.7%** | | PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% | | Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% | ### 3.2 GPU Utilization (200 req, time_scale=20) | Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache | |--------|-------------|-------------|------------|-----------------| | Combined 8colo | **30.5%** (active 64%) | — | — | Distributed | | PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% | | PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% | ### 3.3 Per-Request Breakdown (6P+2D, await mode) | Stage | p50 | % of TTFT | |-------|-----|-----------| | Prefill (queue + compute + KV push) | 0.108s | 12.3% | | Proxy overhead | 0.000s | 0.0% | | **KV pull + decode wait** | **109.6s** | **87.7%** | | Total TTFT | 110.2s | 100% | Root cause of 109.6s `kv+decode`: vLLM decode log shows `Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%`. GPU idle, requests queued for KV cache memory. ### 3.4 Ablations | Ablation | Change | TTFT | TPOT p90 | Verdict | |----------|--------|------|----------|---------| | P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | **Helps TTFT** (less prefill queue) | | Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | **Hurts** (decode KV cache contention) | ## 4. Analysis ### 4.1 DistServe's Assumptions vs Agentic Reality | Assumption | Chatbot (DistServe) | Agentic (this work) | |------------|-------------------|---------------------| | A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 | | B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) | | C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer | | D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight | | **E. Decode KV cache not a bottleneck** | **✅ (short context)** | **❌ THE bottleneck: 97% KV cache on decode** | ### 4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse ``` SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte) Reuse% NewTokens AI (FLOP/byte) Bound vs Decode 0% 64,000 40,758 COMPUTE 26,813x 70% 19,200 20,610 COMPUTE 13,559x 90% 6,400 8,544 COMPUTE 5,621x 95% 3,200 4,549 COMPUTE 2,993x Decode 1 1.5 MEMORY 1x ``` Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with `new_tokens × seq_len` (quadratic in context, not just new tokens). But **absolute FLOPs** drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation. ### 4.3 The Real Bottleneck: Decode KV Cache Memory Wall PD separation concentrates all decode onto fewer GPUs: | | Combined (8 inst) | PD-Sep 6P+2D | |---|---|---| | Decode KV cache total | 8 × 28GB = **224GB** | 2 × 28GB = **56GB** | | Concurrent decode reqs | ~1 per inst | ~4 per inst | | KV cache utilization | Low | **97.1%** | At 97.1% KV cache usage, a 49-token request (KV = few KB) waits **114 seconds** for a 64k-token request to finish decode and release its ~8GB of KV cache. This is **memory-capacity head-of-line blocking**: the GPU is idle (`Running: 0`), but cannot schedule new requests because KV cache is full. ### 4.4 Why Cache-Aware Routing Matters More Than PD Separation | Change | TTFT impact | TPOT p90 impact | APC impact | |--------|-------------|-----------------|------------| | RR → cache-aware routing | **-60%** | **-15%** | **+24pp** | | Combined → PD-Sep | +72% | +1% | -5pp | Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation. ## 5. Elastic P2P Offload: Selective PD Disaggregation ### 5.1 Motivation Full PD separation fails because it concentrates decode onto fewer GPUs (§4.3). But co-located combined mode still suffers from **heavy prefill blocking decode**: a 80k-token prefill occupies the GPU for seconds, during which co-resident decode requests stall (TPOT p90 rises from 0.069 to 0.117). **Elastic P2P** selectively offloads only HEAVY requests (>20k new tokens after prefix cache) to a different instance for prefill via Mooncake RDMA, while WARM/MEDIUM stay co-located. All 8 instances run `kv_role=kv_both` — any instance can act as P or D. ### 5.2 Fair A/B Comparison Both configs: 8 × TP=1 instances, fresh restart, same trace (200 req, time_scale=20, 8 sessions), session-sticky + cache-aware routing. Baseline on dash0, elastic on dash1 (identical H20 ×8 nodes). | Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 | |--------|------|----------|----------|----------|----------|---------| | Baseline (combined) | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s | | Elastic P2P (cap=4) | 185/196 | **1.315s** | **13.179s** | **0.066s** | **0.075s** | **5.708s** | | **Delta** | | **-45%** | **-52%** | **-4%** | **-36%** | **-44%** | ### 5.3 System-Level Breakdown #### 5.3.1 KV Cache Hit Ratio Baseline suffers from extreme APC skew — some instances accumulate hot sessions, others get cold traffic: | Instance | Baseline APC | Elastic prefix APC | Elastic external APC | Elastic effective | |----------|-------------|-------------------|---------------------|-------------------| | inst_0 | 48.6% | 37.8% | 31.6% | 69.4% | | inst_3 | 3.8% | 36.6% | 34.2% | 70.8% | | inst_7 | 68.3% | 25.0% | 0.0% | 25.0% | | **APC std** | **~33pp** | **~7pp (prefix only)** | | | Key observations: - **Baseline APC is highly skewed** (3.8%–68.3% across instances). Instances receiving heavy requests have low APC because heavy requests evict cached prefixes from other sessions. - **Elastic achieves more uniform prefix APC** (~36–38% per instance) because heavy prefills are offloaded to P instances, preserving D-instance cache chains. - **Mooncake external cache adds 30-34pp** on instances that receive offloaded decode, giving effective APC of ~70% on active decode instances. - Elastic's effective cache reuse is higher because the D instance retains the full prefix chain — when the next turn of the same session arrives, it hits the local prefix cache (not requiring another transfer). #### 5.3.2 Success Rate | Config | OK | Total | Rate | Error input p50 | |--------|-----|-------|------|-----------------| | Baseline | 198 | 200 | 99.0% | — | | Elastic | 185 | 196 | 94.4% | ~60k+ tokens | Elastic's lower success rate (94.4% vs 99%) comes from Mooncake transfer timeouts on the largest HEAVY requests. The 4 failed requests and 11 missing (196 vs 200 dispatched) have input >60k tokens. Survivorship bias check: elastic's OK request set has comparable input distribution to baseline (p90 coverage similar), so latency improvement is not an artifact of dropping large requests. #### 5.3.3 Per-Class TTFT Breakdown (Combined Baseline) | Class | Count | % | Input p50 | TTFT p50 | TTFT p90 | TPOT p90 | |-------|-------|---|-----------|----------|----------|----------| | WARM (<5k) | 46 | 23% | 1,095 | 0.133s | 0.260s | 0.060s | | MEDIUM (5-20k) | 50 | 25% | 10,879 | 0.873s | 1.808s | 0.074s | | HEAVY (20-50k) | 64 | 32% | 34,368 | 2.589s | 6.302s | 0.073s | | HEAVY (>50k) | 38 | 19% | 83,018 | 9.563s | 30.480s | 0.096s | HEAVY requests (>20k) constitute 51% of requests but dominate tail latency. A single 80k-token prefill takes ~5-10s of GPU compute, during which co-located decode requests are blocked by chunked prefill interleaving. Elastic offloads precisely these HEAVY requests (≥20k new tokens) to a different instance, so the D instance's decode pipeline is never blocked by large prefills. This is the primary mechanism behind the -36% TPOT p90 improvement. #### 5.3.4 GPU Utilization | Config | Mean | Std | Min | Max | Imbalance | |--------|------|-----|-----|-----|-----------| | Baseline | 28.7% | ~6% | 20% | 38% | 1.9× | | Elastic | 15.8% | ~8% | 7.6% | 30.4% | 3.0× | ### 5.4 Why Elastic Wins Despite Worse GPU Utilization Balance This is the central paradox: elastic uses **45% less GPU** (15.8% vs 28.7%) and has **worse balance** (3.0× vs 1.9×), yet delivers **44% lower E2E latency**. **Three mechanisms explain this:** **1. Eliminating prefill-decode interference (primary, explains TPOT -36%)** In combined mode, vLLM uses chunked prefill to interleave prefill and decode. When a 80k-token HEAVY request arrives on an instance, even with chunked prefill, decode steps are delayed by prefill chunks (each chunk consumes the GPU for tens of ms). This manifests as TPOT p90 = 0.117s in baseline vs 0.075s in elastic — a 36% reduction. Elastic achieves this by routing HEAVY prefills to a *different* instance. The D instance only handles WARM/MEDIUM prefills (which are small and fast) plus decode, so its decode pipeline is never disrupted. **2. Better effective cache utilization (explains TTFT -45%)** Baseline's APC is skewed (3.8%–68.3%). Elastic's Mooncake transfer gives D instances access to KV blocks computed on P instances, achieving ~70% effective hit rate on active instances vs baseline's ~40% average. Higher effective APC means less compute per request → lower TTFT. More importantly: when a HEAVY request's prefill happens on a P instance, the D instance's prefix cache is preserved. In baseline, a 80k-token prefill on the D instance evicts other sessions' cached prefixes, causing future requests to that instance to miss cache. **3. Higher per-request efficiency offsets lower aggregate utilization** Baseline's 28.7% GPU utilization includes wasted work: prefill compute on tokens that *would have* been cached if the cache hadn't been evicted by other heavy prefills on the same instance. Elastic's 15.8% represents more *useful* work per GPU cycle because: - Fewer cache misses → less redundant prefill compute - Less prefill-decode contention → decode finishes faster → KV cache freed sooner - The "idle" GPUs in elastic are instances waiting for their next session turn — they're idle because work *finished faster*, not because they're underutilized The GPU utilization gap (28.7% vs 15.8%) is almost entirely explained by the 44% shorter E2E: the same work completes in 56% of the time, so instantaneous utilization is lower. ### 5.5 GPU Load Imbalance: Root Cause and Improvement The 3.0× imbalance in elastic (7.6% min vs 30.4% max) has two root causes: **Root cause 1: P-instance concentration.** The current offload routing picks `p_inst = min(candidates, key=ongoing_tokens)` — the globally least-loaded instance excluding D. With `MAX_OFFLOAD_INFLIGHT=4`, at most 4 P instances are busy at once, but session-sticky routing means some instances consistently receive more sessions than others, making some consistently busier as D and rarely chosen as P. **Root cause 2: Session skew.** Some sessions have many turns with large inputs (e.g., session 19787: turns at 62k, 74k). The instance pinned to such a session is consistently loaded, while instances pinned to short single-turn sessions go idle quickly. #### Proposed improvement: Round-robin P-instance selection with session awareness ``` Current: p_inst = argmin(ongoing_tokens) excluding d_inst Proposed: p_inst = round_robin(all_instances excluding d_inst), skip if p_inst.ongoing_tokens > 2 * avg_load ``` This distributes P-role work evenly across all non-D instances instead of always picking the least loaded (which concentrates P work on the same few idle instances). The overload gate prevents routing to an already-saturated instance. Additionally, **adaptive MAX_OFFLOAD_INFLIGHT** based on cluster load: - When total ongoing_tokens < threshold: allow more concurrent offloads (e.g., cap=6) - When total ongoing_tokens > threshold: reduce cap (e.g., cap=2) to prevent cascade ## 6. Conclusions 1. **Single-machine PD separation is net negative for agentic workloads** due to decode KV cache memory wall 2. **Cache-aware routing is the dominant optimization** — improves TTFT by 60%, TPOT by 15%, APC by 24pp 3. **Prefill stays compute-bound even at 95% cache reuse**, but absolute compute drops enough to eliminate P-D interference 4. **Elastic P2P offload is net positive** — selective offload of HEAVY requests achieves -45% TTFT, -36% TPOT, -44% E2E by eliminating prefill-decode interference and preserving D-instance cache chains 5. **The GPU utilization paradox** (lower util but better performance) is explained by higher per-request efficiency: less redundant prefill, less contention, and faster KV cache turnover 6. **GPU load imbalance** (3.0× vs 1.9×) in elastic is caused by P-instance concentration and session skew — fixable with round-robin P selection and adaptive offload cap ## 7. Patches Applied to vLLM 0.18.1 | File | Change | Reason | |------|--------|--------| | `v1/core/sched/scheduler.py` | `assert req_id in self.requests` → graceful skip | KV transfer callback races with request abort | --- ## Appendix: Experiment Artifacts ### Data on dash0 (`~/agentic-kv/outputs/`) | Directory | Config | Requests | Notes | |-----------|--------|----------|-------| | `v18_combined_1000req` | TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC | | `exp1_combined_tp2_dp4` | TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) | | `exp2_combined_tp1_dp8` | TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline | | `exp3_pd_sep_tp1_mooncake` | TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations | | `gpu_ab_combined` | TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics | | `gpu_ab_pdsep` | TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics | | `gpu_ab_6p2d` | TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio | | `gpu_ab_6p2d_fnf` | TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling | | `breakdown_await` | TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown | | `ab_baseline` (dash0) | TP=1 DP=8 combined, 200 req | 200 | Fair A/B baseline (§5) | | `ab_elastic` (dash1) | TP=1 DP=8 elastic P2P, 200 req | 196 | Fair A/B elastic (§5) | ### Trace on dash0 | Path | Description | |------|-------------| | `~/ali-trace/trace-glm5.1/` | Raw production logs (301GB, 4 files × 30min) | | `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | Formatted 2h trace (2.1M requests) | | `~/agentic-kv/traces/sampled_1000req_seed42.jsonl` | Sampled 1000 requests for benchmarks | ### Key Scripts | Script | Purpose | |--------|---------| | `scripts/cache_aware_proxy.py` | Unified global scheduler (combined + PD-sep modes) | | `scripts/sample_trace.py` | Trace sampler preserving sessions + hash_ids | | `replayer/` | Async trace replayer with streaming metrics | | `scripts/compute_roofline.py` | Prefill/decode roofline analysis | | `scripts/analyze_cache_hit.py` | Theoretical vs actual KV cache hit ratio | | `scripts/analyze_breakdown.py` | Per-request stage breakdown from proxy | | `scripts/gpu_monitor.sh` | 5s-interval GPU utilization sampling | ### Reproducing ```bash # On dash0, activate env cd ~/agentic-kv && source .venv/bin/activate # Sample trace python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \ --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42 # Combined TP=1 DP=8 + cache-aware scheduler for i in $(seq 0 7); do MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \ --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager & done python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090 python -m replayer --trace traces/sampled_1000req_seed42.jsonl \ --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8 # Breakdown data curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin ```