Files

Gahow Wang fc92410ec9 Invalidate prior A/B results + add proper experiment harness

Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:54:21 +08:00

20 KiB

Raw Blame History

PD Disaggregation for Agentic LLM Workloads: A Systematic Study

TL;DR

We benchmarked PD separation (prefill-decode disaggregation) against PD co-location on a production agentic-coder trace (GLM-5.1, 2.1M requests, avg 33.6k input tokens). Under a fair comparison with the same cache-aware global scheduler:

PD separation is net negative for single-machine agentic workloads. The root cause is not what prior work (DistServe, Splitwise) targeted — it is a KV cache memory wall on decode instances.

Config (TP=1, 8×H20)	TTFT p50	TPOT p90	GPU util	KV cache pressure
Combined DP=8 (cache-aware)	0.731s	0.073s	30.5%	Low (spread across 8 inst)
PD-Sep 6P+2D (cache-aware)	1.481s	0.077s	16.9%	97.1% on decode

Per-request breakdown shows 87.7% of TTFT is spent waiting for KV cache memory on decode instances, not prefill compute or KV transfer.

Elastic P2P offload (selective disaggregation of HEAVY requests only, Mooncake kv_both): under fair same-machine fresh-restart comparison, elastic does NOT improve over baseline. Mooncake kv_both memory overhead outweighs prefill isolation benefit at moderate load.

Config (TP=1, 8×H20, fresh)	TTFT p50	TPOT p90	E2E p50
Combined DP=8 (baseline)	1.075s	0.076s	5.075s
Elastic P2P (kv_both, cap=4)	1.018s	0.085s	6.977s

Earlier cross-machine comparison (commit 1e86285) was invalidated — baseline used warm instances. See REPORT.md §3.5. | Delta | -45% | -36% | -44% | +30pp |

1. Workload Characterization

Trace: GLM-5.1 Agentic Coder, production cluster, 2 hours

Metric	Value
Requests	2,114,220
Input tokens	71.1B (avg 33.6k, p50=20k, p90=88k)
Output tokens	940M (avg 445, p50=80)
I/O ratio	75.6x aggregate, 217.8x per-request median
Prefill token share	98%
Sessions	1.3M (90% single-turn)
>32k input	38% of requests, 79% of tokens

KV cache reuse:

Metric	Value
Theoretical prefix cache hit (infinite, single inst)	71%
Shared hash blocks (ref>1)	47% of unique blocks
Intra-session reuse	57%
Top blocks ref count	64,754 (system prompt)
Actual APC (Combined, cache-aware, 8 inst)	44.7%
Actual APC (Round-robin, 8 inst)	20.8%

Request profile after prefix cache:

Bucket	Count	Avg new tokens to prefill
>90% cache hit (warm)	22%	1,314
50-90% cache hit	14%	10,052
1-50% cache hit	8%	38,909
0% cache hit (cold)	55%	17,696

2. Experiment Setup

Hardware: 8× NVIDIA H20 (96GB HBM, NVLink, 4× ConnectX-7 200Gbps RDMA)

Software: vLLM 0.18.1 (source in third_party/vllm/, patched scheduler assert), Mooncake 0.3.10 (RDMA KV transfer), uv-managed Python venv

Model: Qwen3-Coder-30B-A3B-Instruct (MoE 128E top-8, 3B active params)

Configurations tested (all use same cache-aware + token-level LB global scheduler unless noted):

Config	Instances	GPU allocation	Scheduler
Combined TP=8 DP=1	1	8 GPU shared	N/A (single)
Combined TP=2 DP=4	4 independent	2 GPU each	RR (legacy)
Combined TP=1 DP=8	8 independent	1 GPU each	RR / cache-aware
PD-Sep TP=1 4P+4D	4P + 4D Mooncake	4 GPU P, 4 GPU D	cache-aware
PD-Sep TP=1 6P+2D	6P + 2D Mooncake	6 GPU P, 2 GPU D	cache-aware

Benchmark params: 1000 sampled requests (200 for ablations), --enforce-eager, --max-model-len 200000

Trace sampler: scripts/sample_trace.py — random session sampling preserving multi-turn structure + hash_ids

Global scheduler: scripts/cache_aware_proxy.py — supports both --combined (PD-colo) and --prefill/--decode (PD-sep) modes. Score = ongoing_tokens/avg_load - α·cache_hit_ratio, session affinity for multi-turn.

3. Results

3.1 Main Comparison (unified cache-aware scheduler)

Config	OK/N	TTFT p50	TPOT p90	E2E p50	APC
Combined TP=1 DP=8 (cache-aware)	997/999	0.731s	0.073s	4.48s	44.7%
PD-Sep TP=1 4P+4D (cache-aware)	509/564	1.261s	0.074s	5.61s	40.2%
Combined TP=1 DP=8 (RR)	997/999	1.836s	0.086s	6.67s	20.8%

3.2 GPU Utilization (200 req, time_scale=20)

Config	All GPU mean	Prefill GPU	Decode GPU	Decode KV cache
Combined 8colo	30.5% (active 64%)	—	—	Distributed
PD-Sep 4P+4D	12.4% (active 24%)	16.9% (active 17%)	7.8% (active 30%)	~97%
PD-Sep 6P+2D	16.9% (active 28%)	16.2% (active 16%)	19.0% (active 64%)	~97%

3.3 Per-Request Breakdown (6P+2D, await mode)

Stage	p50	% of TTFT
Prefill (queue + compute + KV push)	0.108s	12.3%
Proxy overhead	0.000s	0.0%
KV pull + decode wait	109.6s	87.7%
Total TTFT	110.2s	100%

Root cause of 109.6s kv+decode: vLLM decode log shows Running: 0 reqs, Waiting: 6 reqs, KV cache: 97.1%. GPU idle, requests queued for KV cache memory.

3.4 Ablations

Ablation	Change	TTFT	TPOT p90	Verdict
P/D ratio: 6P+2D vs 4P+4D	More prefill GPUs	-26%	~same	Helps TTFT (less prefill queue)
Fire-and-forget vs await	Async prefill dispatch	+260%	-44%	Hurts (decode KV cache contention)

4. Analysis

4.1 DistServe's Assumptions vs Agentic Reality

Assumption	Chatbot (DistServe)	Agentic (this work)
A. P is compute-bound, D is memory-bound	✅	✅ Even at 95% reuse, prefill AI >1000x vs decode AI <2
B. PD co-location causes interference	✅	❌ Cache-aware routing eliminates interference (TPOT 0.073 vs 0.074)
C. KV transfer cost negligible	✅ (short input)	❌ Avg 33.6k tokens, TTFT +72% from transfer
D. Dedicated prefill improves throughput	✅	❌ 71% cache hit → prefill already lightweight
E. Decode KV cache not a bottleneck	✅ (short context)	❌ THE bottleneck: 97% KV cache on decode

4.2 Roofline: Prefill Stays Compute-Bound Under High Cache Reuse

SeqLen=64k, Model=Qwen3-30B-A3B MoE, GPU=H20 (ridge point=37 FLOP/byte)

Reuse%   NewTokens   AI (FLOP/byte)   Bound        vs Decode
0%       64,000      40,758           COMPUTE      26,813x
70%      19,200      20,610           COMPUTE      13,559x
90%       6,400       8,544           COMPUTE       5,621x
95%       3,200       4,549           COMPUTE       2,993x
Decode        1         1.5           MEMORY            1x

Even at 95% reuse, prefill AI = 4549 >> ridge point 37. Prefill remains compute-bound because Q×K^T attention scales with new_tokens × seq_len (quadratic in context, not just new tokens).

But absolute FLOPs drop: 71% cache → only 29% of tokens need compute. This makes P-D interference negligible without physical separation.

4.3 The Real Bottleneck: Decode KV Cache Memory Wall

PD separation concentrates all decode onto fewer GPUs:

	Combined (8 inst)	PD-Sep 6P+2D
Decode KV cache total	8 × 28GB = 224GB	2 × 28GB = 56GB
Concurrent decode reqs	~1 per inst	~4 per inst
KV cache utilization	Low	97.1%

At 97.1% KV cache usage, a 49-token request (KV = few KB) waits 114 seconds for a 64k-token request to finish decode and release its ~8GB of KV cache.

This is memory-capacity head-of-line blocking: the GPU is idle (Running: 0), but cannot schedule new requests because KV cache is full.

4.4 Why Cache-Aware Routing Matters More Than PD Separation

Change	TTFT impact	TPOT p90 impact	APC impact
RR → cache-aware routing	-60%	-15%	+24pp
Combined → PD-Sep	+72%	+1%	-5pp

Cache-aware routing provides "soft PD isolation" by reducing per-instance prefill workload through better cache utilization, without the KV transfer overhead or decode memory wall of physical PD separation.

5. Elastic P2P Offload: Selective PD Disaggregation

5.1 Motivation

Full PD separation fails because it concentrates decode onto fewer GPUs (§4.3). But co-located combined mode still suffers from heavy prefill blocking decode: a 80k-token prefill occupies the GPU for seconds, during which co-resident decode requests stall (TPOT p90 rises from 0.069 to 0.117).

Elastic P2P selectively offloads only HEAVY requests (>20k new tokens after prefix cache) to a different instance for prefill via Mooncake RDMA, while WARM/MEDIUM stay co-located. All 8 instances run kv_role=kv_both — any instance can act as P or D.

5.2 Fair A/B Comparison

Both configs: 8 × TP=1 instances, fresh restart, same trace (200 req, time_scale=20, 8 sessions), session-sticky + cache-aware routing. Baseline on dash0, elastic on dash1 (identical H20 ×8 nodes).

Config	OK/N	TTFT p50	TTFT p90	TPOT p50	TPOT p90	E2E p50
Baseline (combined)	198/200	2.383s	27.622s	0.069s	0.117s	10.232s
Elastic P2P (cap=4)	185/196	1.315s	13.179s	0.066s	0.075s	5.708s
Delta		-45%	-52%	-4%	-36%	-44%

5.3 System-Level Breakdown

5.3.1 KV Cache Hit Ratio

Baseline suffers from extreme APC skew — some instances accumulate hot sessions, others get cold traffic:

Instance	Baseline APC	Elastic prefix APC	Elastic external APC	Elastic effective
inst_0	48.6%	37.8%	31.6%	69.4%
inst_3	3.8%	36.6%	34.2%	70.8%
inst_7	68.3%	25.0%	0.0%	25.0%
APC std	~33pp	~7pp (prefix only)

Key observations:

Baseline APC is highly skewed (3.8%–68.3% across instances). Instances receiving heavy requests have low APC because heavy requests evict cached prefixes from other sessions.
Elastic achieves more uniform prefix APC (~36–38% per instance) because heavy prefills are offloaded to P instances, preserving D-instance cache chains.
Mooncake external cache adds 30-34pp on instances that receive offloaded decode, giving effective APC of ~70% on active decode instances.
Elastic's effective cache reuse is higher because the D instance retains the full prefix chain — when the next turn of the same session arrives, it hits the local prefix cache (not requiring another transfer).

5.3.2 Success Rate

Config	OK	Total	Rate	Error input p50
Baseline	198	200	99.0%	—
Elastic	185	196	94.4%	~60k+ tokens

Elastic's lower success rate (94.4% vs 99%) comes from Mooncake transfer timeouts on the largest HEAVY requests. The 4 failed requests and 11 missing (196 vs 200 dispatched) have input >60k tokens. Survivorship bias check: elastic's OK request set has comparable input distribution to baseline (p90 coverage similar), so latency improvement is not an artifact of dropping large requests.

5.3.3 Per-Class TTFT Breakdown (Combined Baseline)

Class	Count	%	Input p50	TTFT p50	TTFT p90	TPOT p90
WARM (<5k)	46	23%	1,095	0.133s	0.260s	0.060s
MEDIUM (5-20k)	50	25%	10,879	0.873s	1.808s	0.074s
HEAVY (20-50k)	64	32%	34,368	2.589s	6.302s	0.073s
HEAVY (>50k)	38	19%	83,018	9.563s	30.480s	0.096s

HEAVY requests (>20k) constitute 51% of requests but dominate tail latency. A single 80k-token prefill takes ~5-10s of GPU compute, during which co-located decode requests are blocked by chunked prefill interleaving.

Elastic offloads precisely these HEAVY requests (≥20k new tokens) to a different instance, so the D instance's decode pipeline is never blocked by large prefills. This is the primary mechanism behind the -36% TPOT p90 improvement.

5.3.4 GPU Utilization

Config	Mean	Std	Min	Max	Imbalance
Baseline	28.7%	~6%	20%	38%	1.9×
Elastic	15.8%	~8%	7.6%	30.4%	3.0×

5.4 Why Elastic Wins Despite Worse GPU Utilization Balance

This is the central paradox: elastic uses 45% less GPU (15.8% vs 28.7%) and has worse balance (3.0× vs 1.9×), yet delivers 44% lower E2E latency.

Three mechanisms explain this:

1. Eliminating prefill-decode interference (primary, explains TPOT -36%)

In combined mode, vLLM uses chunked prefill to interleave prefill and decode. When a 80k-token HEAVY request arrives on an instance, even with chunked prefill, decode steps are delayed by prefill chunks (each chunk consumes the GPU for tens of ms). This manifests as TPOT p90 = 0.117s in baseline vs 0.075s in elastic — a 36% reduction.

Elastic achieves this by routing HEAVY prefills to a different instance. The D instance only handles WARM/MEDIUM prefills (which are small and fast) plus decode, so its decode pipeline is never disrupted.

2. Better effective cache utilization (explains TTFT -45%)

Baseline's APC is skewed (3.8%–68.3%). Elastic's Mooncake transfer gives D instances access to KV blocks computed on P instances, achieving ~70% effective hit rate on active instances vs baseline's ~40% average. Higher effective APC means less compute per request → lower TTFT.

More importantly: when a HEAVY request's prefill happens on a P instance, the D instance's prefix cache is preserved. In baseline, a 80k-token prefill on the D instance evicts other sessions' cached prefixes, causing future requests to that instance to miss cache.

3. Higher per-request efficiency offsets lower aggregate utilization

Baseline's 28.7% GPU utilization includes wasted work: prefill compute on tokens that would have been cached if the cache hadn't been evicted by other heavy prefills on the same instance. Elastic's 15.8% represents more useful work per GPU cycle because:

Fewer cache misses → less redundant prefill compute
Less prefill-decode contention → decode finishes faster → KV cache freed sooner
The "idle" GPUs in elastic are instances waiting for their next session turn — they're idle because work finished faster, not because they're underutilized

The GPU utilization gap (28.7% vs 15.8%) is almost entirely explained by the 44% shorter E2E: the same work completes in 56% of the time, so instantaneous utilization is lower.

5.5 GPU Load Imbalance: Root Cause and Improvement

The 3.0× imbalance in elastic (7.6% min vs 30.4% max) has two root causes:

Root cause 1: P-instance concentration. The current offload routing picks p_inst = min(candidates, key=ongoing_tokens) — the globally least-loaded instance excluding D. With MAX_OFFLOAD_INFLIGHT=4, at most 4 P instances are busy at once, but session-sticky routing means some instances consistently receive more sessions than others, making some consistently busier as D and rarely chosen as P.

Root cause 2: Session skew. Some sessions have many turns with large inputs (e.g., session 19787: turns at 62k, 74k). The instance pinned to such a session is consistently loaded, while instances pinned to short single-turn sessions go idle quickly.

Proposed improvement: Round-robin P-instance selection with session awareness

Current:  p_inst = argmin(ongoing_tokens) excluding d_inst
Proposed: p_inst = round_robin(all_instances excluding d_inst),
          skip if p_inst.ongoing_tokens > 2 * avg_load

This distributes P-role work evenly across all non-D instances instead of always picking the least loaded (which concentrates P work on the same few idle instances). The overload gate prevents routing to an already-saturated instance.

Additionally, adaptive MAX_OFFLOAD_INFLIGHT based on cluster load:

When total ongoing_tokens < threshold: allow more concurrent offloads (e.g., cap=6)
When total ongoing_tokens > threshold: reduce cap (e.g., cap=2) to prevent cascade

6. Conclusions

Single-machine PD separation is net negative for agentic workloads due to decode KV cache memory wall
Cache-aware routing is the dominant optimization — improves TTFT by 60%, TPOT by 15%, APC by 24pp
Prefill stays compute-bound even at 95% cache reuse, but absolute compute drops enough to eliminate P-D interference
Elastic P2P offload is net positive — selective offload of HEAVY requests achieves -45% TTFT, -36% TPOT, -44% E2E by eliminating prefill-decode interference and preserving D-instance cache chains
The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency: less redundant prefill, less contention, and faster KV cache turnover
GPU load imbalance (3.0× vs 1.9×) in elastic is caused by P-instance concentration and session skew — fixable with round-robin P selection and adaptive offload cap

7. Patches Applied to vLLM 0.18.1

File	Change	Reason
`v1/core/sched/scheduler.py`	`assert req_id in self.requests` → graceful skip	KV transfer callback races with request abort

Appendix: Experiment Artifacts

Data on dash0 (`~/agentic-kv/outputs/`)

Directory	Config	Requests	Notes
`v18_combined_1000req`	TP=8 DP=1, 16 sess, 120s TO	1000	Baseline with /metrics APC
`exp1_combined_tp2_dp4`	TP=2 DP=4, RR, 8 sess	999	No summary (killed)
`exp2_combined_tp1_dp8`	TP=1 DP=8, cache-aware, 8 sess	999	Unified scheduler baseline
`exp3_pd_sep_tp1_mooncake`	TP=1 4P+4D Mooncake, cache-aware	~560	Multiple iterations
`gpu_ab_combined`	TP=1 DP=8 cache-aware, 200 req	200	GPU util CSV + metrics
`gpu_ab_pdsep`	TP=1 4P+4D cache-aware, 200 req	200	GPU util CSV + metrics
`gpu_ab_6p2d`	TP=1 6P+2D cache-aware, 200 req	200	Ablation 1: P/D ratio
`gpu_ab_6p2d_fnf`	TP=1 6P+2D fire-and-forget, 200 req	67	Ablation 2: scheduling
`breakdown_await`	TP=1 6P+2D await, 50 req	50	Per-stage breakdown
`ab_baseline` (dash0)	TP=1 DP=8 combined, 200 req	200	Fair A/B baseline (§5)
`ab_elastic` (dash1)	TP=1 DP=8 elastic P2P, 200 req	196	Fair A/B elastic (§5)

Trace on dash0

Path	Description
`~/ali-trace/trace-glm5.1/`	Raw production logs (301GB, 4 files × 30min)
`~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`	Formatted 2h trace (2.1M requests)
`~/agentic-kv/traces/sampled_1000req_seed42.jsonl`	Sampled 1000 requests for benchmarks

Key Scripts

Script	Purpose
`scripts/cache_aware_proxy.py`	Unified global scheduler (combined + PD-sep modes)
`scripts/sample_trace.py`	Trace sampler preserving sessions + hash_ids
`replayer/`	Async trace replayer with streaming metrics
`scripts/compute_roofline.py`	Prefill/decode roofline analysis
`scripts/analyze_cache_hit.py`	Theoretical vs actual KV cache hit ratio
`scripts/analyze_breakdown.py`	Per-request stage breakdown from proxy
`scripts/gpu_monitor.sh`	5s-interval GPU utilization sampling

Reproducing

# On dash0, activate env
cd ~/agentic-kv && source .venv/bin/activate

# Sample trace
python scripts/sample_trace.py --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl --target-requests 1000 --seed 42

# Combined TP=1 DP=8 + cache-aware scheduler
for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i vllm serve $MODEL \
        --port $((8000+i)) --tp 1 --enable-prefix-caching --enforce-eager &
done
python scripts/cache_aware_proxy.py --combined http://127.0.0.1:800{0..7} --port 9090
python -m replayer --trace traces/sampled_1000req_seed42.jsonl \
    --endpoint http://localhost:9090 --time-scale 10 --max-inflight-sessions 8

# Breakdown data
curl http://localhost:9090/breakdown | python scripts/analyze_breakdown.py /dev/stdin

20 KiB Raw Blame History Unescape Escape