Gahow Wang 1b9268ba4c P2P prefill offload: TTFT p50 -13% but p90 +59% (median-vs-tail tradeoff)
Fixed race condition in P instance selection (all going to inst_0).
P2P design: HEAVY requests prefill on least-loaded OTHER instance,
KV transfer via Mooncake, decode on session-sticky instance.

Result (200 req, fresh restart, vs baseline):
  TTFT p50: 1.080 -> 0.939 (-13%)   <- median improves (decode not disrupted)
  TTFT p90: 9.410 -> 14.987 (+59%)  <- tail worsens (KV transfer on large req)
  TPOT p90: 0.076 -> 0.075 (-1%)    <- unchanged (not the bottleneck)
  E2E p50: 5.306 -> 5.565 (+5%)     <- slightly worse overall

The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because
their instance isn't blocked by a heavy prefill) but hurts HEAVY requests
(extra KV transfer latency). This is a median-vs-tail tradeoff.

For SLOs targeting p50: P2P offload helps.
For SLOs targeting p90/p99: baseline combined is better.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 12:28:24 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%