With balanced session-sticky routing:
LRU APC = 49.2% (only 1.8pp below infinite 51.0%)
LFU APC = 43.5% (worse than LRU!)
SessionProtLRU = 49.0% (no improvement)
The previous 10.1pp gap was from routing imbalance (all traffic to inst_0),
not from cache eviction policy. Balanced routing recovers 5.9pp of the gap.
Multi-turn sessions get 80.1% APC with simple LRU + session-sticky routing
because inter-turn gap is only 2 requests (LRU naturally keeps it warm).
Conclusion: fix routing balance, not cache policy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.
Three approaches designed:
A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
B. Two-tier KV cache: GPU + DRAM offload via Mooncake
C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)
Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.
Result: no significant difference vs baseline on single-machine combined mode.
TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)
Per-class TTFT breakdown shows the optimization target:
WARM (75 req): p50=0.198s (cache hit, nearly free)
MEDIUM (72 req): p50=1.356s
HEAVY (54 req): p50=7.124s (36x slower than WARM)
Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.
This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:
vllm/v1/core/sched/scheduler.py:
Replace fatal assert with graceful skip when KV transfer callback
arrives for an already-aborted request during PD disaggregated serving.
Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
patches/0001-fix-kv-transfer-abort-race.patch:
Fix scheduler assert crash when KV transfer callback arrives
after request abort in PD-disaggregated serving.
patches/README.md:
How to apply patches to source tree or installed package.
Per-patch description with problem/fix/impact.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Breakdown profiling at proxy level captures:
t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token
Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill.
Root cause: decode instance KV cache memory saturation (97.1% usage).
With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache.
Large agentic requests (avg 33.6k tokens) fill this quickly.
Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache
to be freed by large requests completing decode.
vLLM log confirms: Running=0, Waiting=6, KV cache=97.1%
GPU is idle but requests queue for KV cache memory, not compute.
This is the fundamental bottleneck of single-machine PD separation
for long-context agentic workloads: concentrating decode onto fewer
GPUs creates a KV cache memory wall.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added --fire-and-forget flag to cache_aware_proxy.py for async prefill dispatch.
Results on 6P+2D config:
Await: TTFT=1.48s TPOT=0.066s E2E=5.95s 94% success
FnF: TTFT=5.32s TPOT=0.037s E2E=11.9s 85% success
Fire-and-forget improves TPOT by 44% (pipeline overlap) but degrades
TTFT by 260% (decode internally waits for KV, less efficiently than
proxy-level await) and increases errors from KV race conditions.
Full 4-way ablation summary in analyze_ablations.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6P+2D gives more GPUs to prefill, fewer to decode:
- Decode util: 7.8% (4D) -> 19.0% (2D), less waste
- TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing
- But Combined (30.5% util, TTFT 1.01s) still best overall
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).
Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation
Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>