agentic-kvc

Go to file

Gahow Wang 1b9268ba4c P2P prefill offload: TTFT p50 -13% but p90 +59% (median-vs-tail tradeoff)

Fixed race condition in P instance selection (all going to inst_0).
P2P design: HEAVY requests prefill on least-loaded OTHER instance,
KV transfer via Mooncake, decode on session-sticky instance.

Result (200 req, fresh restart, vs baseline):
  TTFT p50: 1.080 -> 0.939 (-13%)   <- median improves (decode not disrupted)
  TTFT p90: 9.410 -> 14.987 (+59%)  <- tail worsens (KV transfer on large req)
  TPOT p90: 0.076 -> 0.075 (-1%)    <- unchanged (not the bottleneck)
  E2E p50: 5.306 -> 5.565 (+5%)     <- slightly worse overall

The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because
their instance isn't blocked by a heavy prefill) but hurts HEAVY requests
(extra KV transfer latency). This is a median-vs-tail tradeoff.

For SLOs targeting p50: P2P offload helps.
For SLOs targeting p90/p99: baseline combined is better.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 12:28:24 +08:00

analysis

Update report: adaptive v2 confirms no KV transfer helps single-machine

2026-05-22 10:15:08 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Balanced session-sticky routing + agentic workload pattern analysis

2026-05-22 01:50:27 +08:00

scripts

P2P prefill offload: TTFT p50 -13% but p90 +59% (median-vs-tail tradeoff)