Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.
Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.
Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.
Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.
H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.
Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks all hypotheses tested during elastic PD disaggregation research:
- H1 (kv_both overhead): REJECTED — zero overhead at idle
- H2 (PS cold prefill): REJECTED — PS slower than cached C
- H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117%
- H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY
- H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer
- H6 (session migration): TODO — verify D's APC after migration
Key insight: offload decision should be cache-aware (new_tokens),
not size-based (total_input). 80k request with 90% cache = 8k prefill.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.
Result (67/200 processed, 75% success):
TTFT p50: 0.551s (-49% vs baseline 1.080s)
TTFT p90: 4.135s (vs baseline 9.410s, -56%)
TPOT p90: 0.074s (same as baseline)
E2E p50: 2.938s (-45% vs baseline 5.306s)
25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.
Also: added external_prefix_cache metrics tracking to replayer summary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.
Three approaches designed:
A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
B. Two-tier KV cache: GPU + DRAM offload via Mooncake
C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)
Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.
This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).
Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation
Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>