# Elastic P2P Offload Design **Date**: 2026-05-22 **Context**: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug). --- ## 1. Problem Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance. - When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode) - When P instance is busy → offload queues behind P's own work → TTFT p90 explodes ## 2. Design: Elastic Offload with Load-Aware Decision **Core idea**: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request: ``` For each HEAVY request: 1. Compute offload_benefit = estimated decode disruption saved on D instance 2. Compute offload_cost = P instance queue delay + KV transfer time 3. if offload_benefit > offload_cost → OFFLOAD else → COLOCATE (do P+D on session-sticky instance) ``` ### 2.1 Offload Decision Function ```python def should_offload(estimated_new_tokens, d_inst, p_inst): """Decide whether to offload this HEAVY request.""" # Cost: how long will P take? (queue + compute) p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT # seconds p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT kv_transfer_time = 0.5 # empirical constant from our measurements offload_cost = p_queue_time + kv_transfer_time # p_compute_time same either way # Benefit: how much would colocated prefill disrupt D's decode? # If D is currently decoding (ongoing_decode_tokens > 0), disruption is real. # If D is idle, there's no disruption to avoid. d_is_decoding = d_inst.ongoing_decode_tokens > 0 disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0 offload_benefit = disruption_time * 0.5 # chunked prefill doesn't fully block decode return offload_benefit > offload_cost ``` ### 2.2 Simplified Heuristic (for implementation) The above is complex. Simpler version: ```python def should_offload(estimated_new_tokens, d_inst, p_inst): """Offload only if P is significantly less loaded than D.""" # Don't offload if P is more loaded than D (would make things worse) if p_inst.ongoing_tokens >= d_inst.ongoing_tokens: return False # Don't offload if P is already heavily loaded (queue too long) avg_load = average(inst.ongoing_tokens for inst in all_instances) if p_inst.ongoing_tokens > avg_load * 1.5: return False # Offload if D is currently busy with decode if d_inst.ongoing_decode_tokens > 0: return True # D is idle — no benefit from offloading return False ``` ### 2.3 Key Properties 1. **HEAVY + P idle + D busy** → OFFLOAD (best case: P has capacity, D benefits from isolation) 2. **HEAVY + P busy** → COLOCATE (P would queue, no benefit) 3. **HEAVY + D idle** → COLOCATE (no decode to disrupt) 4. **WARM/MEDIUM** → always COLOCATE (small prefill, not worth transfer overhead) ### 2.4 Expected Behavior Under Load ``` Low load (few concurrent requests): Most instances idle → P always available → most HEAVY offloaded Medium load (8 concurrent sessions): Some instances busy → offload only when P is free ~50% of HEAVY offloaded, ~50% colocated High load (all instances busy): No instance has spare capacity → almost nothing offloaded Falls back to pure combined mode (which is optimal under high load) ``` This naturally adapts: offload when there's spare capacity, colocate when system is saturated. ## 3. Metrics to Track Per-request breakdown (proxy-level): - `route_class`: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLO - `offload_decision_reason`: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold" - `t_proxy_recv`, `t_prefill_sent`, `t_prefill_done`, `t_first_token`, `t_done` Per-instance (from vLLM /metrics + logs): - `prefix_cache_hit_rate` (local) - `external_prefix_cache_hit_rate` (Mooncake KV) - Combined: local + external = total effective APC GPU utilization (5s sampling): - Per-GPU util%, memory usage - Detect load imbalance early ## 4. Implementation Changes to `cache_aware_proxy.py`: - Replace fixed `if estimated_new >= HEAVY_THRESHOLD` with `should_offload()` function - Track `ongoing_decode_tokens` per instance (already have this) - Add `offload_decision_reason` to breakdown log - Add `--prefill-throughput` parameter (tokens/s, for cost estimation)