Design: offload HEAVY prefill only when P instance is less loaded than D AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D for future KV reuse. External KV correctly registered in prefix cache. Result (67/200 processed, 75% success): TTFT p50: 0.551s (-49% vs baseline 1.080s) TTFT p90: 4.135s (vs baseline 9.410s, -56%) TPOT p90: 0.074s (same as baseline) E2E p50: 2.938s (-45% vs baseline 5.306s) 25% error rate from ReadTimeout on very large HEAVY requests queuing on P. Needs stricter elastic gate or higher timeout. But successful requests show significant improvement over both baseline and previous P2P. Also: added external_prefix_cache metrics tracking to replayer summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.5 KiB
4.5 KiB
Elastic P2P Offload Design
Date: 2026-05-22 Context: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug).
1. Problem
Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance.
- When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode)
- When P instance is busy → offload queues behind P's own work → TTFT p90 explodes
2. Design: Elastic Offload with Load-Aware Decision
Core idea: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request:
For each HEAVY request:
1. Compute offload_benefit = estimated decode disruption saved on D instance
2. Compute offload_cost = P instance queue delay + KV transfer time
3. if offload_benefit > offload_cost → OFFLOAD
else → COLOCATE (do P+D on session-sticky instance)
2.1 Offload Decision Function
def should_offload(estimated_new_tokens, d_inst, p_inst):
"""Decide whether to offload this HEAVY request."""
# Cost: how long will P take? (queue + compute)
p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT # seconds
p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT
kv_transfer_time = 0.5 # empirical constant from our measurements
offload_cost = p_queue_time + kv_transfer_time # p_compute_time same either way
# Benefit: how much would colocated prefill disrupt D's decode?
# If D is currently decoding (ongoing_decode_tokens > 0), disruption is real.
# If D is idle, there's no disruption to avoid.
d_is_decoding = d_inst.ongoing_decode_tokens > 0
disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0
offload_benefit = disruption_time * 0.5 # chunked prefill doesn't fully block decode
return offload_benefit > offload_cost
2.2 Simplified Heuristic (for implementation)
The above is complex. Simpler version:
def should_offload(estimated_new_tokens, d_inst, p_inst):
"""Offload only if P is significantly less loaded than D."""
# Don't offload if P is more loaded than D (would make things worse)
if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
return False
# Don't offload if P is already heavily loaded (queue too long)
avg_load = average(inst.ongoing_tokens for inst in all_instances)
if p_inst.ongoing_tokens > avg_load * 1.5:
return False
# Offload if D is currently busy with decode
if d_inst.ongoing_decode_tokens > 0:
return True
# D is idle — no benefit from offloading
return False
2.3 Key Properties
- HEAVY + P idle + D busy → OFFLOAD (best case: P has capacity, D benefits from isolation)
- HEAVY + P busy → COLOCATE (P would queue, no benefit)
- HEAVY + D idle → COLOCATE (no decode to disrupt)
- WARM/MEDIUM → always COLOCATE (small prefill, not worth transfer overhead)
2.4 Expected Behavior Under Load
Low load (few concurrent requests):
Most instances idle → P always available → most HEAVY offloaded
Medium load (8 concurrent sessions):
Some instances busy → offload only when P is free
~50% of HEAVY offloaded, ~50% colocated
High load (all instances busy):
No instance has spare capacity → almost nothing offloaded
Falls back to pure combined mode (which is optimal under high load)
This naturally adapts: offload when there's spare capacity, colocate when system is saturated.
3. Metrics to Track
Per-request breakdown (proxy-level):
route_class: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLOoffload_decision_reason: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold"t_proxy_recv,t_prefill_sent,t_prefill_done,t_first_token,t_done
Per-instance (from vLLM /metrics + logs):
prefix_cache_hit_rate(local)external_prefix_cache_hit_rate(Mooncake KV)- Combined: local + external = total effective APC
GPU utilization (5s sampling):
- Per-GPU util%, memory usage
- Detect load imbalance early
4. Implementation
Changes to cache_aware_proxy.py:
- Replace fixed
if estimated_new >= HEAVY_THRESHOLDwithshould_offload()function - Track
ongoing_decode_tokensper instance (already have this) - Add
offload_decision_reasonto breakdown log - Add
--prefill-throughputparameter (tokens/s, for cost estimation)