- REPORT.md: self-contained milestone report covering baseline vs elastic setup, exact launch commands, benchmark params, results, log locations, and repo structure — sufficient for anyone to reproduce - analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown (KV cache hit ratio, per-class TTFT, GPU util paradox explanation) - scripts/cache_aware_proxy.py: round-robin P-instance selection replacing argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x) - scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
20 KiB
PD Disaggregation for Agentic LLM Workloads: A Systematic Study
TL;DR
We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:
PD separation is net negative for single-machine agentic workloads. The root cause is not what prior work (DistServe, Splitwise) targeted — it is a KV cache memory wall on decode instances.
| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | GPU util | KV cache pressure |
|---|---|---|---|---|
| Combined DP=8 (cache-aware) | 0.731s | 0.073s | 30.5% | Low (spread across 8 inst) |
| PD-Sep 6P+2D (cache-aware) | 1.481s | 0.077s | 16.9% | 97.1% on decode |
Per-request breakdown shows 87.7% of TTFT is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.
Elastic P2P offload (selective disaggregation of HEAVY requests only) recovers the wins of PD separation without the memory wall:
| Config (TP=1, 8×H20) | TTFT p50 | TPOT p90 | E2E p50 | Effective APC |
|---|---|---|---|---|
| Combined DP=8 (baseline) | 2.383s | 0.117s | 10.232s | ~40% (skewed) |
| Elastic P2P (cap=4) | 1.315s | 0.075s | 5.708s | ~70% (balanced) |
| Delta | -45% | -36% | -44% | +30pp |
1. Workload Characterization
Trace: GLM-5.1 Agentic Coder, production cluster, 2 hours
| Metric | Value |
|---|---|
| Requests | 2,114,220 |
| Input tokens | 71.1B (avg 33.6k, p50=20k, p90=88k) |
| Output tokens | 940M (avg 445, p50=80) |
| I/O ratio | 75.6x aggregate, 217.8x per-request median |
| Prefill token share | 98% |
| Sessions | 1.3M (90% single-turn) |
| >32k input | 38% of requests, 79% of tokens |
KV cache reuse:
| Metric | Value |
|---|---|
| Theoretical prefix cache hit (infinite, single inst) | 71% |
| Shared hash blocks (ref>1) | 47% of unique blocks |
| Intra-session reuse | 57% |
| Top blocks ref count | 64,754 (system prompt) |
| Actual APC (Combined, cache-aware, 8 inst) | 44.7% |
| Actual APC (Round-robin, 8 inst) | 20.8% |
Request profile after prefix cache:
| Bucket | Count | Avg new tokens to prefill |
|---|---|---|
| >90% cache hit (warm) | 22% | 1,314 |
| 50-90% cache hit | 14% | 10,052 |
| 1-50% cache hit | 8% | 38,909 |
| 0% cache hit (cold) | 55% | 17,696 |
2. Experiment Setup
Hardware: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)
Software: vLLM 0.18.1 (source in third_party/vllm/, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv
Model: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)
Configurations tested (all use same cache-aware + token-level LB global scheduler unless noted):
| Config | Instances | GPU allocation | Scheduler |
|---|---|---|---|
| Combined TP=8 DP=1 | 1 | 8 GPU shared | N/A (single) |
| Combined TP=2 DP=4 | 4 independent | 2 GPU each | RR (legacy) |
| Combined TP=1 DP=8 | 8 independent | 1 GPU each | RR / cache-aware |
| PD-Sep TP=1 4P+4D | 4P + 4D Mooncake | 4 GPU P, 4 GPU D | cache-aware |
| PD-Sep TP=1 6P+2D | 6P + 2D Mooncake | 6 GPU P, 2 GPU D | cache-aware |
Benchmark params: 1000 sampled requests (200 for ablations), --enforce-eager, --max-model-len 200000
Trace sampler: scripts/sample_trace.py — random session sampling preserving multi-turn structure + hash_ids
Global scheduler: scripts/cache_aware_proxy.py — supports both --combined (PD-colo) and --prefill/--decode (PD-sep) modes. Score = ongoing_tokens/avg_load - α·cache_hit_ratio, session affinity for multi-turn.
3. Results
3.1 Main Comparison (unified cache-aware scheduler)
| Config | OK/N | TTFT p50 | TPOT p90 | E2E p50 | APC |
|---|---|---|---|---|---|
| Combined TP=1 DP=8 (cache-aware) | 997/999 | 0.731s | 0.073s | 4.48s | 44.7% |
| PD-Sep TP=1 4P+4D (cache-aware) | 509/564 | 1.261s | 0.074s | 5.61s | 40.2% |
| Combined TP=1 DP=8 (RR) | 997/999 | 1.836s | 0.086s | 6.67s | 20.8% |
3.2 GPU Utilization (200 req, time_scale=20)
| Config | All GPU mean | Prefill GPU | Decode GPU | Decode KV cache |
|---|---|---|---|---|
| Combined 8colo | 30.5% (active 64%) | — | — | Distributed |
| PD-Sep 4P+4D | 12.4% (active 24%) | 16.9% (active 17%) | 7.8% (active 30%) | ~97% |
| PD-Sep 6P+2D | 16.9% (active 28%) | 16.2% (active 16%) | 19.0% (active 64%) | ~97% |
3.3 Per-Request Breakdown (6P+2D, await mode)
| Stage | p50 | % of TTFT |
|---|---|---|
| Prefill (queue + compute + KV push) | 0.108s | 12.3% |
| Proxy overhead | 0.000s | 0.0% |
| KV pull + decode wait | 109.6s | 87.7% |
| Total TTFT | 110.2s | 100% |
Root cause of 109.6s kv+decode: vLLM decode log shows Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%. GPU idle, requests queued for KV cache memory.
3.4 Ablations
| Ablation | Change | TTFT | TPOT p90 | Verdict |
|---|---|---|---|---|
| P/D ratio: 6P+2D vs 4P+4D | More prefill GPUs | -26% | ~same | Helps TTFT (less prefill queue) |
| Fire-and-forget vs await | Async prefill dispatch | +260% | -44% | Hurts (decode KV cache contention) |
4. Analysis
4.1 DistServe's Assumptions vs Agentic Reality
| Assumption | Chatbot (DistServe) | Agentic (this work) |
|---|---|---|
| A. P is compute-bound, D is memory-bound | ✅ | ✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2 |
| B. PD co-location causes interference | ✅ | ❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074) |
| C. KV transfer cost negligible | ✅ (short input) | ❌ Avg 33.6k tokens, TTFT +72% from transfer |
| D. Dedicated prefill improves throughput | ✅ | ❌ 71% cache hit → prefill already lightweight |
| E. Decode KV cache not a bottleneck | ✅ (short context) | ❌ THE bottleneck: 97% KV cache on decode |
4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse
SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)
Reuse% NewTokens AI (FLOP/byte) Bound vs Decode
0% 64,000 40,758 COMPUTE 26,813x
70% 19,200 20,610 COMPUTE 13,559x
90% 6,400 8,544 COMPUTE 5,621x
95% 3,200 4,549 COMPUTE 2,993x
Decode 1 1.5 MEMORY 1x
Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with new_tokens × seq_len (quadratic in context, not just new tokens).
But absolute FLOPs drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.
4.3 The Real Bottleneck: Decode KV Cache Memory Wall
PD separation concentrates all decode onto fewer GPUs:
| Combined (8 inst) | PD-Sep 6P+2D | |
|---|---|---|
| Decode KV cache total | 8 × 28GB = 224GB | 2 × 28GB = 56GB |
| Concurrent decode reqs | ~1 per inst | ~4 per inst |
| KV cache utilization | Low | 97.1% |
At 97.1% KV cache usage, a 49-token request (KV = few KB) waits 114 seconds for a 64k-token request to finish decode and release its ~8GB of KV cache.
This is memory-capacity head-of-line blocking: the GPU is idle (Running: 0), but cannot schedule new requests because KV cache is full.
4.4 Why Cache-Aware Routing Matters More Than PD Separation
| Change | TTFT impact | TPOT p90 impact | APC impact |
|---|---|---|---|
| RR → cache-aware routing | -60% | -15% | +24pp |
| Combined → PD-Sep | +72% | +1% | -5pp |
Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.
5. Elastic P2P Offload: Selective PD Disaggregation
5.1 Motivation
Full PD separation fails because it concentrates decode onto fewer GPUs (§4.3). But co-located combined mode still suffers from heavy prefill blocking decode: a 80k-token prefill occupies the GPU for seconds, during which co-resident decode requests stall (TPOT p90 rises from 0.069 to 0.117).
Elastic P2P selectively offloads only HEAVY requests (>20k new tokens after prefix cache) to a different instance for prefill via Mooncake RDMA, while WARM/MEDIUM stay co-located. All 8 instances run kv_role=kv_both — any instance can act as P or D.
5.2 Fair A/B Comparison
Both configs: 8 × TP=1 instances, fresh restart, same trace (200 req, time_scale=20, 8 sessions), session-sticky + cache-aware routing. Baseline on dash0, elastic on dash1 (identical H20 ×8 nodes).
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 |
|---|---|---|---|---|---|---|
| Baseline (combined) | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Elastic P2P (cap=4) | 185/196 | 1.315s | 13.179s | 0.066s | 0.075s | 5.708s |
| Delta | -45% | -52% | -4% | -36% | -44% |
5.3 System-Level Breakdown
5.3.1 KV Cache Hit Ratio
Baseline suffers from extreme APC skew — some instances accumulate hot sessions, others get cold traffic:
| Instance | Baseline APC | Elastic prefix APC | Elastic external APC | Elastic effective |
|---|---|---|---|---|
| inst_0 | 48.6% | 37.8% | 31.6% | 69.4% |
| inst_3 | 3.8% | 36.6% | 34.2% | 70.8% |
| inst_7 | 68.3% | 25.0% | 0.0% | 25.0% |
| APC std | ~33pp | ~7pp (prefix only) |
Key observations:
- Baseline APC is highly skewed (3.8%–68.3% across instances). Instances receiving heavy requests have low APC because heavy requests evict cached prefixes from other sessions.
- Elastic achieves more uniform prefix APC (~36–38% per instance) because heavy prefills are offloaded to P instances, preserving D-instance cache chains.
- Mooncake external cache adds 30-34pp on instances that receive offloaded decode, giving effective APC of ~70% on active decode instances.
- Elastic's effective cache reuse is higher because the D instance retains the full prefix chain — when the next turn of the same session arrives, it hits the local prefix cache (not requiring another transfer).
5.3.2 Success Rate
| Config | OK | Total | Rate | Error input p50 |
|---|---|---|---|---|
| Baseline | 198 | 200 | 99.0% | — |
| Elastic | 185 | 196 | 94.4% | ~60k+ tokens |
Elastic's lower success rate (94.4% vs 99%) comes from Mooncake transfer timeouts on the largest HEAVY requests. The 4 failed requests and 11 missing (196 vs 200 dispatched) have input >60k tokens. Survivorship bias check: elastic's OK request set has comparable input distribution to baseline (p90 coverage similar), so latency improvement is not an artifact of dropping large requests.
5.3.3 Per-Class TTFT Breakdown (Combined Baseline)
| Class | Count | % | Input p50 | TTFT p50 | TTFT p90 | TPOT p90 |
|---|---|---|---|---|---|---|
| WARM (<5k) | 46 | 23% | 1,095 | 0.133s | 0.260s | 0.060s |
| MEDIUM (5-20k) | 50 | 25% | 10,879 | 0.873s | 1.808s | 0.074s |
| HEAVY (20-50k) | 64 | 32% | 34,368 | 2.589s | 6.302s | 0.073s |
| HEAVY (>50k) | 38 | 19% | 83,018 | 9.563s | 30.480s | 0.096s |
HEAVY requests (>20k) constitute 51% of requests but dominate tail latency. A single 80k-token prefill takes ~5-10s of GPU compute, during which co-located decode requests are blocked by chunked prefill interleaving.
Elastic offloads precisely these HEAVY requests (≥20k new tokens) to a different instance, so the D instance's decode pipeline is never blocked by large prefills. This is the primary mechanism behind the -36% TPOT p90 improvement.
5.3.4 GPU Utilization
| Config | Mean | Std | Min | Max | Imbalance |
|---|---|---|---|---|---|
| Baseline | 28.7% | ~6% | 20% | 38% | 1.9× |
| Elastic | 15.8% | ~8% | 7.6% | 30.4% | 3.0× |
5.4 Why Elastic Wins Despite Worse GPU Utilization Balance
This is the central paradox: elastic uses 45% less GPU (15.8% vs 28.7%) and has worse balance (3.0× vs 1.9×), yet delivers 44% lower E2E latency.
Three mechanisms explain this:
1. Eliminating prefill-decode interference (primary, explains TPOT -36%)
In combined mode, vLLM uses chunked prefill to interleave prefill and decode. When a 80k-token HEAVY request arrives on an instance, even with chunked prefill, decode steps are delayed by prefill chunks (each chunk consumes the GPU for tens of ms). This manifests as TPOT p90 = 0.117s in baseline vs 0.075s in elastic — a 36% reduction.
Elastic achieves this by routing HEAVY prefills to a different instance. The D instance only handles WARM/MEDIUM prefills (which are small and fast) plus decode, so its decode pipeline is never disrupted.
2. Better effective cache utilization (explains TTFT -45%)
Baseline's APC is skewed (3.8%–68.3%). Elastic's Mooncake transfer gives D instances access to KV blocks computed on P instances, achieving ~70% effective hit rate on active instances vs baseline's ~40% average. Higher effective APC means less compute per request → lower TTFT.
More importantly: when a HEAVY request's prefill happens on a P instance, the D instance's prefix cache is preserved. In baseline, a 80k-token prefill on the D instance evicts other sessions' cached prefixes, causing future requests to that instance to miss cache.
3. Higher per-request efficiency offsets lower aggregate utilization
Baseline's 28.7% GPU utilization includes wasted work: prefill compute on tokens that would have been cached if the cache hadn't been evicted by other heavy prefills on the same instance. Elastic's 15.8% represents more useful work per GPU cycle because:
- Fewer cache misses → less redundant prefill compute
- Less prefill-decode contention → decode finishes faster → KV cache freed sooner
- The "idle" GPUs in elastic are instances waiting for their next session turn — they're idle because work finished faster, not because they're underutilized
The GPU utilization gap (28.7% vs 15.8%) is almost entirely explained by the 44% shorter E2E: the same work completes in 56% of the time, so instantaneous utilization is lower.
5.5 GPU Load Imbalance: Root Cause and Improvement
The 3.0× imbalance in elastic (7.6% min vs 30.4% max) has two root causes:
Root cause 1: P-instance concentration. The current offload routing picks p_inst = min(candidates, key=ongoing_tokens) — the globally least-loaded instance excluding D. With MAX_OFFLOAD_INFLIGHT=4, at most 4 P instances are busy at once, but session-sticky routing means some instances consistently receive more sessions than others, making some consistently busier as D and rarely chosen as P.
Root cause 2: Session skew. Some sessions have many turns with large inputs (e.g., session 19787: turns at 62k, 74k). The instance pinned to such a session is consistently loaded, while instances pinned to short single-turn sessions go idle quickly.
Proposed improvement: Round-robin P-instance selection with session awareness
Current: p_inst = argmin(ongoing_tokens) excluding d_inst
Proposed: p_inst = round_robin(all_instances excluding d_inst),
skip if p_inst.ongoing_tokens > 2 * avg_load
This distributes P-role work evenly across all non-D instances instead of always picking the least loaded (which concentrates P work on the same few idle instances). The overload gate prevents routing to an already-saturated instance.
Additionally, adaptive MAX_OFFLOAD_INFLIGHT based on cluster load:
- When total ongoing_tokens < threshold: allow more concurrent offloads (e.g., cap=6)
- When total ongoing_tokens > threshold: reduce cap (e.g., cap=2) to prevent cascade
6. Conclusions
- Single-machine PD separation is net negative for agentic workloads due to decode KV cache memory wall
- Cache-aware routing is the dominant optimization — improves TTFT by 60%, TPOT by 15%, APC by 24pp
- Prefill stays compute-bound even at 95% cache reuse, but absolute compute drops enough to eliminate P-D interference
- Elastic P2P offload is net positive — selective offload of HEAVY requests achieves -45% TTFT, -36% TPOT, -44% E2E by eliminating prefill-decode interference and preserving D-instance cache chains
- The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency: less redundant prefill, less contention, and faster KV cache turnover
- GPU load imbalance (3.0× vs 1.9×) in elastic is caused by P-instance concentration and session skew — fixable with round-robin P selection and adaptive offload cap
7. Patches Applied to vLLM 0.18.1
| File | Change | Reason |
|---|---|---|
v1/core/sched/scheduler.py |
assert req_id in self.requests → graceful skip |
KV transfer callback races with request abort |
Appendix: Experiment Artifacts
Data on dash0 (~/agentic-kv/outputs/)
| Directory | Config | Requests | Notes |
|---|---|---|---|
v18_combined_1000req |
TP=8 DP=1, 16 sess, 120s TO | 1000 | Baseline with /metrics APC |
exp1_combined_tp2_dp4 |
TP=2 DP=4, RR, 8 sess | 999 | No summary (killed) |
exp2_combined_tp1_dp8 |
TP=1 DP=8, cache-aware, 8 sess | 999 | Unified scheduler baseline |
exp3_pd_sep_tp1_mooncake |
TP=1 4P+4D Mooncake, cache-aware | ~560 | Multiple iterations |
gpu_ab_combined |
TP=1 DP=8 cache-aware, 200 req | 200 | GPU util CSV + metrics |
gpu_ab_pdsep |
TP=1 4P+4D cache-aware, 200 req | 200 | GPU util CSV + metrics |
gpu_ab_6p2d |
TP=1 6P+2D cache-aware, 200 req | 200 | Ablation 1: P/D ratio |
gpu_ab_6p2d_fnf |
TP=1 6P+2D fire-and-forget, 200 req | 67 | Ablation 2: scheduling |
breakdown_await |
TP=1 6P+2D await, 50 req | 50 | Per-stage breakdown |
ab_baseline (dash0) |
TP=1 DP=8 combined, 200 req | 200 | Fair A/B baseline (§5) |
ab_elastic (dash1) |
TP=1 DP=8 elastic P2P, 200 req | 196 | Fair A/B elastic (§5) |
Trace on dash0
| Path | Description |
|---|---|
~/ali-trace/trace-glm5.1/ |
Raw production logs (301GB, 4 files × 30min) |
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl |
Formatted 2h trace (2.1M requests) |
~/agentic-kv/traces/sampled_1000req_seed42.jsonl |
Sampled 1000 requests for benchmarks |
Key Scripts
| Script | Purpose |
|---|---|
scripts/cache_aware_proxy.py |
Unified global scheduler (combined + PD-sep modes) |
scripts/sample_trace.py |
Trace sampler preserving sessions + hash_ids |
replayer/ |
Async trace replayer with streaming metrics |
scripts/compute_roofline.py |
Prefill/decode roofline analysis |
scripts/analyze_cache_hit.py |
Theoretical vs actual KV cache hit ratio |
scripts/analyze_breakdown.py |
Per-request stage breakdown from proxy |
scripts/gpu_monitor.sh |
5s-interval GPU utilization sampling |
Reproducing
# On dash0, activate env
cd ~/agentic-kv && source .venv/bin/activate
# Sample trace
python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42
# Combined TP=1 DP=8 + cache-aware scheduler
for i in $(seq 0 7); do
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
--port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
done
python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
--endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8
# Breakdown data
curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin