Breakdown profiling at proxy level captures:
t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token
Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill.
Root cause: decode instance KV cache memory saturation (97.1% usage).
With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache.
Large agentic requests (avg 33.6k tokens) fill this quickly.
Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache
to be freed by large requests completing decode.
vLLM log confirms: Running=0, Waiting=6, KV cache=97.1%
GPU is idle but requests queue for KV cache memory, not compute.
This is the fundamental bottleneck of single-machine PD separation
for long-context agentic workloads: concentrating decode onto fewer
GPUs creates a KV cache memory wall.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>