Files
agentic-kvc/analysis/pd_separation_analysis.md
Gahow Wang fc92410ec9 Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:54:21 +08:00

20 KiB
Raw Blame History

PD Disaggregation for Agentic LLM Workloads: A Systematic Study

TL;DR

We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:

PD separation is net negative for single-machine agentic workloads. The root cause is not what prior work (DistServe, Splitwise) targeted — it is a KV cache memory wall on decode instances.

Config (TP=1, 8×H20) TTFT p50 TPOT p90 GPU util KV cache pressure
Combined DP=8 (cache-aware) 0.731s 0.073s 30.5% Low (spread across 8 inst)
PD-Sep 6P+2D (cache-aware) 1.481s 0.077s 16.9% 97.1% on decode

Per-request breakdown shows 87.7% of TTFT is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.

Elastic P2P offload (selective disaggregation of HEAVY requests only, Mooncake kv_both): under fair same-machine fresh-restart comparison, elastic does NOT improve over baseline. Mooncake kv_both memory overhead outweighs prefill isolation benefit at moderate load.

Config (TP=1, 8×H20, fresh) TTFT p50 TPOT p90 E2E p50
Combined DP=8 (baseline) 1.075s 0.076s 5.075s
Elastic P2P (kv_both, cap=4) 1.018s 0.085s 6.977s

Earlier cross-machine comparison (commit 1e86285) was invalidated — baseline used warm instances. See REPORT.md §3.5. | Delta | -45% | -36% | -44% | +30pp |


1. Workload Characterization

Trace: GLM-5.1 Agentic Coder, production cluster, 2 hours

Metric Value
Requests 2,114,220
Input tokens 71.1B (avg 33.6k, p50=20k, p90=88k)
Output tokens 940M (avg 445, p50=80)
I/O ratio 75.6x aggregate, 217.8x per-request median
Prefill token share 98%
Sessions 1.3M (90% single-turn)
>32k input 38% of requests, 79% of tokens

KV cache reuse:

Metric Value
Theoretical prefix cache hit (infinite, single inst) 71%
Shared hash blocks (ref>1) 47% of unique blocks
Intra-session reuse 57%
Top blocks ref count 64,754 (system prompt)
Actual APC (Combined, cache-aware, 8 inst) 44.7%
Actual APC (Round-robin, 8 inst) 20.8%

Request profile after prefix cache:

Bucket Count Avg new tokens to prefill
>90% cache hit (warm) 22% 1,314
50-90% cache hit 14% 10,052
1-50% cache hit 8% 38,909
0% cache hit (cold) 55% 17,696

2. Experiment Setup

Hardware: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)

Software: vLLM 0.18.1 (source in third_party/vllm/, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv

Model: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)

Configurations tested (all use same cache-aware + token-level LB global scheduler unless noted):

Config Instances GPU allocation Scheduler
Combined TP=8 DP=1 1 8 GPU shared N/A (single)
Combined TP=2 DP=4 4 independent 2 GPU each RR (legacy)
Combined TP=1 DP=8 8 independent 1 GPU each RR / cache-aware
PD-Sep TP=1 4P+4D 4P + 4D Mooncake 4 GPU P, 4 GPU D cache-aware
PD-Sep TP=1 6P+2D 6P + 2D Mooncake 6 GPU P, 2 GPU D cache-aware

Benchmark params: 1000 sampled requests (200 for ablations), --enforce-eager, --max-model-len 200000

Trace sampler: scripts/sample_trace.py — random session sampling preserving multi-turn structure + hash_ids

Global scheduler: scripts/cache_aware_proxy.py — supports both --combined (PD-colo) and --prefill/--decode (PD-sep) modes. Score = ongoing_tokens/avg_load - α·cache_hit_ratio, session affinity for multi-turn.

3. Results

3.1 Main Comparison (unified cache-aware scheduler)

Config OK/N TTFT p50 TPOT p90 E2E p50 APC
Combined TP=1 DP=8 (cache-aware) 997/999 0.731s 0.073s 4.48s 44.7%
PD-Sep TP=1 4P+4D (cache-aware) 509/564 1.261s 0.074s 5.61s 40.2%
Combined TP=1 DP=8 (RR) 997/999 1.836s 0.086s 6.67s 20.8%

3.2 GPU Utilization (200 req, time_scale=20)

Config All GPU mean Prefill GPU Decode GPU Decode KV cache
Combined 8colo 30.5% (active 64%) Distributed
PD-Sep 4P+4D 12.4% (active 24%) 16.9% (active 17%) 7.8% (active 30%) ~97%
PD-Sep 6P+2D 16.9% (active 28%) 16.2% (active 16%) 19.0% (active 64%) ~97%

3.3 Per-Request Breakdown (6P+2D, await mode)

Stage p50 % of TTFT
Prefill (queue + compute + KV push) 0.108s 12.3%
Proxy overhead 0.000s 0.0%
KV pull + decode wait 109.6s 87.7%
Total TTFT 110.2s 100%

Root cause of 109.6s kv+decode: vLLM decode log shows Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%. GPU idle, requests queued for KV cache memory.

3.4 Ablations

Ablation Change TTFT TPOT p90 Verdict
P/D ratio: 6P+2D vs 4P+4D More prefill GPUs -26% ~same Helps TTFT (less prefill queue)
Fire-and-forget vs await Async prefill dispatch +260% -44% Hurts (decode KV cache contention)

4. Analysis

4.1 DistServe's Assumptions vs Agentic Reality

Assumption Chatbot (DistServe) Agentic (this work)
A. P is compute-bound, D is memory-bound Even at 95% reuse, prefill AI >1000x vs decode AI <2
B. PD co-location causes interference Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074)
C. KV transfer cost negligible (short input) Avg 33.6k tokens, TTFT +72% from transfer
D. Dedicated prefill improves throughput 71% cache hit → prefill already lightweight
E. Decode KV cache not a bottleneck (short context) THE bottleneck: 97% KV cache on decode

4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse

SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)

Reuse%   NewTokens   AI (FLOP/byte)   Bound        vs Decode
0%       64,000      40,758           COMPUTE      26,813x
70%      19,200      20,610           COMPUTE      13,559x
90%       6,400       8,544           COMPUTE       5,621x
95%       3,200       4,549           COMPUTE       2,993x
Decode        1         1.5           MEMORY            1x

Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with new_tokens × seq_len (quadratic in context, not just new tokens).

But absolute FLOPs drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.

4.3 The Real Bottleneck: Decode KV Cache Memory Wall

PD separation concentrates all decode onto fewer GPUs:

Combined (8 inst) PD-Sep 6P+2D
Decode KV cache total 8 × 28GB = 224GB 2 × 28GB = 56GB
Concurrent decode reqs ~1 per inst ~4 per inst
KV cache utilization Low 97.1%

At 97.1% KV cache usage, a 49-token request (KV = few KB) waits 114 seconds for a 64k-token request to finish decode and release its ~8GB of KV cache.

This is memory-capacity head-of-line blocking: the GPU is idle (Running: 0), but cannot schedule new requests because KV cache is full.

4.4 Why Cache-Aware Routing Matters More Than PD Separation

Change TTFT impact TPOT p90 impact APC impact
RR → cache-aware routing -60% -15% +24pp
Combined → PD-Sep +72% +1% -5pp

Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.

5. Elastic P2P Offload: Selective PD Disaggregation

5.1 Motivation

Full PD separation fails because it concentrates decode onto fewer GPUs (§4.3). But co-located combined mode still suffers from heavy prefill blocking decode: a 80k-token prefill occupies the GPU for seconds, during which co-resident decode requests stall (TPOT p90 rises from 0.069 to 0.117).

Elastic P2P selectively offloads only HEAVY requests (>20k new tokens after prefix cache) to a different instance for prefill via Mooncake RDMA, while WARM/MEDIUM stay co-located. All 8 instances run kv_role=kv_both — any instance can act as P or D.

5.2 Fair A/B Comparison

Both configs: 8 × TP=1 instances, fresh restart, same trace (200 req, time_scale=20, 8 sessions), session-sticky + cache-aware routing. Baseline on dash0, elastic on dash1 (identical H20 ×8 nodes).

Config OK/N TTFT p50 TTFT p90 TPOT p50 TPOT p90 E2E p50
Baseline (combined) 198/200 2.383s 27.622s 0.069s 0.117s 10.232s
Elastic P2P (cap=4) 185/196 1.315s 13.179s 0.066s 0.075s 5.708s
Delta -45% -52% -4% -36% -44%

5.3 System-Level Breakdown

5.3.1 KV Cache Hit Ratio

Baseline suffers from extreme APC skew — some instances accumulate hot sessions, others get cold traffic:

Instance Baseline APC Elastic prefix APC Elastic external APC Elastic effective
inst_0 48.6% 37.8% 31.6% 69.4%
inst_3 3.8% 36.6% 34.2% 70.8%
inst_7 68.3% 25.0% 0.0% 25.0%
APC std ~33pp ~7pp (prefix only)

Key observations:

  • Baseline APC is highly skewed (3.8%68.3% across instances). Instances receiving heavy requests have low APC because heavy requests evict cached prefixes from other sessions.
  • Elastic achieves more uniform prefix APC (~3638% per instance) because heavy prefills are offloaded to P instances, preserving D-instance cache chains.
  • Mooncake external cache adds 30-34pp on instances that receive offloaded decode, giving effective APC of ~70% on active decode instances.
  • Elastic's effective cache reuse is higher because the D instance retains the full prefix chain — when the next turn of the same session arrives, it hits the local prefix cache (not requiring another transfer).

5.3.2 Success Rate

Config OK Total Rate Error input p50
Baseline 198 200 99.0%
Elastic 185 196 94.4% ~60k+ tokens

Elastic's lower success rate (94.4% vs 99%) comes from Mooncake transfer timeouts on the largest HEAVY requests. The 4 failed requests and 11 missing (196 vs 200 dispatched) have input >60k tokens. Survivorship bias check: elastic's OK request set has comparable input distribution to baseline (p90 coverage similar), so latency improvement is not an artifact of dropping large requests.

5.3.3 Per-Class TTFT Breakdown (Combined Baseline)

Class Count % Input p50 TTFT p50 TTFT p90 TPOT p90
WARM (<5k) 46 23% 1,095 0.133s 0.260s 0.060s
MEDIUM (5-20k) 50 25% 10,879 0.873s 1.808s 0.074s
HEAVY (20-50k) 64 32% 34,368 2.589s 6.302s 0.073s
HEAVY (>50k) 38 19% 83,018 9.563s 30.480s 0.096s

HEAVY requests (>20k) constitute 51% of requests but dominate tail latency. A single 80k-token prefill takes ~5-10s of GPU compute, during which co-located decode requests are blocked by chunked prefill interleaving.

Elastic offloads precisely these HEAVY requests (≥20k new tokens) to a different instance, so the D instance's decode pipeline is never blocked by large prefills. This is the primary mechanism behind the -36% TPOT p90 improvement.

5.3.4 GPU Utilization

Config Mean Std Min Max Imbalance
Baseline 28.7% ~6% 20% 38% 1.9×
Elastic 15.8% ~8% 7.6% 30.4% 3.0×

5.4 Why Elastic Wins Despite Worse GPU Utilization Balance

This is the central paradox: elastic uses 45% less GPU (15.8% vs 28.7%) and has worse balance (3.0× vs 1.9×), yet delivers 44% lower E2E latency.

Three mechanisms explain this:

1. Eliminating prefill-decode interference (primary, explains TPOT -36%)

In combined mode, vLLM uses chunked prefill to interleave prefill and decode. When a 80k-token HEAVY request arrives on an instance, even with chunked prefill, decode steps are delayed by prefill chunks (each chunk consumes the GPU for tens of ms). This manifests as TPOT p90 = 0.117s in baseline vs 0.075s in elastic — a 36% reduction.

Elastic achieves this by routing HEAVY prefills to a different instance. The D instance only handles WARM/MEDIUM prefills (which are small and fast) plus decode, so its decode pipeline is never disrupted.

2. Better effective cache utilization (explains TTFT -45%)

Baseline's APC is skewed (3.8%68.3%). Elastic's Mooncake transfer gives D instances access to KV blocks computed on P instances, achieving ~70% effective hit rate on active instances vs baseline's ~40% average. Higher effective APC means less compute per request → lower TTFT.

More importantly: when a HEAVY request's prefill happens on a P instance, the D instance's prefix cache is preserved. In baseline, a 80k-token prefill on the D instance evicts other sessions' cached prefixes, causing future requests to that instance to miss cache.

3. Higher per-request efficiency offsets lower aggregate utilization

Baseline's 28.7% GPU utilization includes wasted work: prefill compute on tokens that would have been cached if the cache hadn't been evicted by other heavy prefills on the same instance. Elastic's 15.8% represents more useful work per GPU cycle because:

  • Fewer cache misses → less redundant prefill compute
  • Less prefill-decode contention → decode finishes faster → KV cache freed sooner
  • The "idle" GPUs in elastic are instances waiting for their next session turn — they're idle because work finished faster, not because they're underutilized

The GPU utilization gap (28.7% vs 15.8%) is almost entirely explained by the 44% shorter E2E: the same work completes in 56% of the time, so instantaneous utilization is lower.

5.5 GPU Load Imbalance: Root Cause and Improvement

The 3.0× imbalance in elastic (7.6% min vs 30.4% max) has two root causes:

Root cause 1: P-instance concentration. The current offload routing picks p_inst = min(candidates, key=ongoing_tokens) — the globally least-loaded instance excluding D. With MAX_OFFLOAD_INFLIGHT=4, at most 4 P instances are busy at once, but session-sticky routing means some instances consistently receive more sessions than others, making some consistently busier as D and rarely chosen as P.

Root cause 2: Session skew. Some sessions have many turns with large inputs (e.g., session 19787: turns at 62k, 74k). The instance pinned to such a session is consistently loaded, while instances pinned to short single-turn sessions go idle quickly.

Proposed improvement: Round-robin P-instance selection with session awareness

Current:  p_inst = argmin(ongoing_tokens) excluding d_inst
Proposed: p_inst = round_robin(all_instances excluding d_inst),
          skip if p_inst.ongoing_tokens > 2 * avg_load

This distributes P-role work evenly across all non-D instances instead of always picking the least loaded (which concentrates P work on the same few idle instances). The overload gate prevents routing to an already-saturated instance.

Additionally, adaptive MAX_OFFLOAD_INFLIGHT based on cluster load:

  • When total ongoing_tokens < threshold: allow more concurrent offloads (e.g., cap=6)
  • When total ongoing_tokens > threshold: reduce cap (e.g., cap=2) to prevent cascade

6. Conclusions

  1. Single-machine PD separation is net negative for agentic workloads due to decode KV cache memory wall
  2. Cache-aware routing is the dominant optimization — improves TTFT by 60%, TPOT by 15%, APC by 24pp
  3. Prefill stays compute-bound even at 95% cache reuse, but absolute compute drops enough to eliminate P-D interference
  4. Elastic P2P offload is net positive — selective offload of HEAVY requests achieves -45% TTFT, -36% TPOT, -44% E2E by eliminating prefill-decode interference and preserving D-instance cache chains
  5. The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency: less redundant prefill, less contention, and faster KV cache turnover
  6. GPU load imbalance (3.0× vs 1.9×) in elastic is caused by P-instance concentration and session skew — fixable with round-robin P selection and adaptive offload cap

7. Patches Applied to vLLM 0.18.1

File Change Reason
v1/core/sched/scheduler.py assert req_id in self.requests → graceful skip KV transfer callback races with request abort

Appendix: Experiment Artifacts

Data on dash0 (~/agentic-kv/outputs/)

Directory Config Requests Notes
v18_combined_1000req TP=8 DP=1, 16 sess, 120s TO 1000 Baseline with /metrics APC
exp1_combined_tp2_dp4 TP=2 DP=4, RR, 8 sess 999 No summary (killed)
exp2_combined_tp1_dp8 TP=1 DP=8, cache-aware, 8 sess 999 Unified scheduler baseline
exp3_pd_sep_tp1_mooncake TP=1 4P+4D Mooncake, cache-aware ~560 Multiple iterations
gpu_ab_combined TP=1 DP=8 cache-aware, 200 req 200 GPU util CSV + metrics
gpu_ab_pdsep TP=1 4P+4D cache-aware, 200 req 200 GPU util CSV + metrics
gpu_ab_6p2d TP=1 6P+2D cache-aware, 200 req 200 Ablation 1: P/D ratio
gpu_ab_6p2d_fnf TP=1 6P+2D fire-and-forget, 200 req 67 Ablation 2: scheduling
breakdown_await TP=1 6P+2D await, 50 req 50 Per-stage breakdown
ab_baseline (dash0) TP=1 DP=8 combined, 200 req 200 Fair A/B baseline (§5)
ab_elastic (dash1) TP=1 DP=8 elastic P2P, 200 req 196 Fair A/B elastic (§5)

Trace on dash0

Path Description
~/ali-trace/trace-glm5.1/ Raw production logs (301GB, 4 files × 30min)
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl Formatted 2h trace (2.1M requests)
~/agentic-kv/traces/sampled_1000req_seed42.jsonl Sampled 1000 requests for benchmarks

Key Scripts

Script Purpose
scripts/cache_aware_proxy.py Unified global scheduler (combined + PD-sep modes)
scripts/sample_trace.py Trace sampler preserving sessions + hash_ids
replayer/ Async trace replayer with streaming metrics
scripts/compute_roofline.py Prefill/decode roofline analysis
scripts/analyze_cache_hit.py Theoretical vs actual KV cache hit ratio
scripts/analyze_breakdown.py Per-request stage breakdown from proxy
scripts/gpu_monitor.sh 5s-interval GPU utilization sampling

Reproducing

# On dash0, activate env
cd ~/agentic-kv && source .venv/bin/activate

# Sample trace
python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42

# Combined TP=1 DP=8 + cache-aware scheduler
for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
        --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
done
python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
    --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8

# Breakdown data
curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin